UPDATED 20:21 EST / MARCH 04 2026

AI

Microsoft open-sources multimodal reasoning model with 15B parameters

Microsoft Corp. today released a hardware-efficient reasoning model, Phi-4-reasoning-vision-15B, that can process multimodal files such as scientific charts.

The model is based on two existing algorithms called SigLIP-2 and Phi-4 Reasoning. SigLIP-2 compresses images into a numerical form that neural networks can understand. Phi-4 Reasoning, in turn, is a reasoning model that Microsoft open-sourced last April.

The company’s researchers combined the two algorithms using an approach known as mid-fusion. 

An artificial intelligence model comprises collections of artificial neurons called layers. Engineers can equip all of a model’s layers with the ability to process multimodal data. In mid-fusion models such as Phi-4-reasoning-vision-15B, only some of the layers support multimodal processing. That arrangement trades off some output quality for a significant reduction in hardware use.

According to Microsoft, users can further lower the model’s infrastructure footprint by disabling its reasoning feature. The capability can be turned on and off with prompts.

The company mainly trained Phi-4-reasoning-vision-15B on open-source data. The data included images and text-based descriptions of the objects depicted in those images. Before it started training the model, Microsoft refined the files through a multistep process.

First, the company identified high-quality datasets that didn’t require changes and set them aside. It then searched for file collections that comprised high-quality images with inaccurate captions. Microsoft’s researchers generated new captions for those images using GPT-4o and o4-mini.

The company enriched the refined open-source files with internally-created training data and “high-quality data from targeted acquisitions.” Furthermore, it added in examples of behaviors that the model should avoid. The latter dataset helps Phi-4-reasoning-vision-15B avoid harmful output.

Microsoft compared the algorithm to several similarly sized reasoning models using a set of open-source benchmarks. Phi-4-reasoning-vision-15B scored 17% higher than Google LLC’s gemma-3-12b-it on MathVista_Mini, a benchmark that comprises multimodal math questions. The model also achieved higher scores across more than a half-dozen other evaluations. 

“We have competitive performance to much slower models that require ten times or more compute-time and tokens and better accuracy than similarly fast models, particularly when it comes to math and science reasoning,” Microsoft researchers wrote in a blog post today.

Developers can use Phi-4-reasoning-vision-15B to build AI agents that interact with applications via their user interfaces. The model is capable of deducing the function of different interface elements based on screenshots.

“With strong high-resolution perception and fine-grained grounding capabilities, Phi-4-reasoning-vision-15B is a compelling option as a base-model for training agentic models such as ones that navigate desktop, web, and mobile interfaces by identifying and localizing interactive elements such as buttons, menus, and text fields,” the researchers explained.

The model can also analyze more complicated visual assets such as scientific charts. In a demo shared by Microsoft, a user uploaded a photo of Saturn and asked Phi-4-reasoning-vision-15B why the planet appears tilted. It explained that Saturn’s orientation depends on the time of year and the position of the telescope that took the photo.

Microsoft has made the model’s code available on Hugging Face, GitHub and Azure.

Photo: Pixabay

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.