Microsoft releases Phi-4 language model trained mainly on synthetic data
Microsoft Corp. has developed a small language model that can solve certain math problems better than algorithms several times its size.
The company revealed the model, Phi-4, on Thursday. The algorithm’s performance is notable mainly because of the way it was built: Microsoft trained Phi-4 mostly on synthetic, or machine-generated, data rather than web content as is the usual practice. The model’s math prowess hints that incorporating more synthetic files into small models’ training datasets could be a way to boost their reasoning skills.
Phi-4 is the fourth iteration of an open-source language model series Microsoft introduced last year. Its architecture is nearly identical to that of its predecessor, Phi-3-medium. Both neural networks feature 14 billion parameters and can process prompts with up to 4,000 tokens, units of data that each contain a few characters.
One difference is that Phi-4 features an upgraded tokenizer. This is a component that breaks down user prompts into tokens, which makes the text easier to process.
Microsoft also enhanced Phi-4’s attention mechanism. This is a software component that language models use to find the most important details in a piece of text. The attention mechanism in the previous-generation phi-3-medium could only consider up to 2,000 tokens’ worth of user input, while Phi-4 can analyze 4,000.
The main innovation in Phi-4 is the way it was trained. Microsoft trained the model using no fewer than 50 synthetic datasets that collectively contained about 400 billion tokens. Its researchers created the files through a multistep process.
In the first phase, Microsoft collected content from the public web, its existing artificial intelligence training datasets and other sources. The information included, among others, tens of millions of question and answer pairs.
Microsoft removed questions to which it found multiple identical answers online. The reason, the company explained, is that this is often a sign a question is too simple. While at it, Microsoft removed questions that appeared too complicated because the available answers diverged significantly from one another.
The company leveraged this initial batch of files as a template from which it generated synthetic data. Microsoft’s researchers used several different methods to produce the synthetic files.
In one phase of the project, the researchers used an AI to rewrite information from the web into test questions. Microsoft then had the AI model generate answers. Lastly, the company instructed the algorithm to analyze its answers and improve them where possible.
In another phase of the project, Microsoft used open-source code as the starting point of the synthetic data generation process. The company entered a code snippet into an AI and asked it to generate a question to which the correct answer is the provided code snippet. This question was then incorporated into the training dataset that Microsoft used to develop Phi-4.
After creating the initial version of the dataset, Microsoft checked it for accuracy using a set of automated workflows. “We incorporate tests for validating our reasoning-heavy synthetic datasets,” Phi-4’s developers wrote in a research paper. “The synthetic code data is validated through execution loops and tests. For scientific datasets, the questions are extracted from scientific materials.”
After it completed the training process, Microsoft evaluated Phi-4’s output quality across more than a dozen benchmarks. The algorithm outperformed its predecessor across all but one, in some cases by more than 20%.
Notably, Phi-4 also managed to best GPT-4o and Meta Platforms Inc.’s recently released Llama 3.3 across two benchmarks: GPQA and MATH. The former dataset comprises 448 multi-choice questions spanning various scientific fields. MATH includes math problems. According to Microsoft, Phi-4 outperformed Llama 3.3 by more than 5% across both tests despite the fact it has a fifth as many parameters.
“Phi-4 outperforms comparable and larger models on math related reasoning due to advancements throughout the processes, including the use of high-quality synthetic datasets, curation of high-quality organic data, and post-training innovations,” Ece Kamar, managing director of Microsoft’s AI Frontiers group, wrote in a blog post.
Phi-4 is currently accessible through the company’s Azure AI Foundry service. Microsoft plans to make the code available on Hugging Face next week.
Photo: Microsoft
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU