

Microsoft Corp. today expanded its Phi line of open-source language models with two new algorithms optimized for multimodal processing and hardware efficiency.
The first addition is the text-only Phi-4-mini. The second new model, Phi-4-multimodal, is an upgraded version of Phi-4-mini that can also process visual and audio input. Microsoft says that both models significantly outperform comparably sized alternatives at certain tasks.
Phi-4-mini, the text-only model, features 3.8 billion parameters. That makes it compact enough to run on mobile devices. It’s based on the ubiquitous transformer neural network architecture that underpins most LLMs.
A standard transformer model analyzes the text before and after a word to understand its meaning. According to Microsoft, Phi-4-mini is based on a version of the architecture called a decoder-only transformer that takes a different approach. Such models only analyze the text that precedes a word when trying to determine its meaning, which lowers hardware usage and speeds up processing speed.
Phi-4-mini also uses a second performance optimization technique called grouped query attention, or GQA. It reduces the hardware usage of the algorithm’s attention mechanism. A language model’s attention mechanism helps it determine which data points are most relevant to a given processing task.
Phi-4-mini can generate text, translate existing documents and take actions in external applications. According to Microsoft, it’s particularly adept at math and coding tasks that require “complex reasoning.” In a series of internal benchmark tests, the company determined that Phi-4-mini can complete such tasks with “significantly” better accuracy than several similarly-sized language models.
The second new model that Microsoft released today, Phi-4-multimodal, is an upgraded version of Phi-4-mini with 5.6 billion parameters. It can process not only text but also images, audio and video. Microsoft trained the model using a new technique it dubs Mixture of LoRAs.
Adapting an AI to a new task usually requires changing its weights, the configuration settings that determine how it crunches data. This process can be costly and time-consuming. As a result, researchers often use a different approach known as LoRA. Instead of modifying existing weights, LoRA teaches a model to perform an unfamiliar task by adding a small number of new weights optimized for that task.
Microsoft’s Mixture of LoRA method applies the same concept to multimodal processing. To create Phi-4-multimodal, the company extended Phi-4-mini with weights optimized to process audio and visual data. According to Microsoft, the technique mitigates some of the tradeoffs associated with other approaches to building multimodal models.
The company tested Phi-4-multimodal’s capabilities using more than a half-dozen visual data processing benchmarks. The model achieved an average score of 72, trailing OpenAI’s GPT-4 by less than one point. Google LLC’s Gemini Flash 2.0, a cutting-edge large language model that debuted in December, scored 74.3.
Phi-4-multimodal achieved even better performance in a set of benchmark tests that involved both visual and audio input. According to Microsoft, the model outperformed Gemini-2.0 Flash “by a large margin.” Phi-4-multimodal also bested InternOmni, an open-source LLM that is built specifically to process multimodal data and has a higher parameter count.
Microsoft will make Phi-4-multimodal and Phi-4-mini available on Hugging Face under an MIT license, which permits commercial use.
THANK YOU