UPDATED 11:05 EST / DECEMBER 16 2025

AI

Allen Institute for AI introduces Molmo 2, bringing open video understanding to AI systems

Building on a foundation of image understanding artificial intelligence models, the Allen Institute for AI today introduced Molmo 2, a multimodal model family adapted to computer video and multi-image understanding.

In 2024, Ai2 released Molmo, which set a new benchmark for image understanding and helped establish a reference for powerful “pointing” and tagging capabilities. Those models went beyond describing what appeared in an image; they could identify and tag objects with a high degree of confidence.

The Molmo 2 family includes three variants, each designed for different use cases: Molmo 2 8B, Molmo 2 4B and Molmo 2-O 7B. The 8B and 4B models are based on Qwen 3, Alibaba Group Holding Ltd.’s open-weights reasoning models, and provide video grounding and question-answering capabilities. The Molmo 2-O variant is built on Olmo, Ai2’s open-source model family focused on high intelligence and reasoning performance.

According to Ai2, the smaller Molmo 2 models deliver outsized performance relative to their size. The 8B model exceeds the original Molmo 72 billion-parameter model on key image understanding tasks and related benchmarks, setting a new standard for efficiency.

On image and multi-image reasoning, the 4B variant still excels at reasoning, in spite of its extremely compact size. It exceeds open models such as Qwen 3-VL-8B and is trained on far less data than similar models. It uses only 9.19 million videos compared to 72.5 million for Meta Platform Inc.’s PerceptronLM.

These smaller sizes allow the model’s efficient deployment using less hardware, lowering costs while increasing availability to essential capabilities.

“With Olmo, we set the standard for truly open AI, then last year Molmo ushered the industry toward pointing; Molmo 2 pushes it even further by bringing these capabilities to videos and temporal domains,” said Ai2 Chief Executive Ali Farhadi.

Models such as Molmo 2 form a foundation for assistive and intelligent physical technologies, often referred to as physical AI. These systems perceive, understand and reason about the real world to interact with it meaningfully.

For machines to interact with their environment, they must first understand what they are observing. Humans perform this task intuitively, but machines require AI models that can segment objects, track them over time, tag them consistently and assign expected properties.

Ai2 said Molmo 2 introduces capabilities to video understanding that no prior open model has delivered. This includes identifying exactly where and when events occur, tracking multiple objects through complex scenes and connecting actions to frame-level timelines.

This improved understanding of the physical world is essential for intelligent systems such as traffic cameras, retail item-tracking platforms, safety monitoring systems, autonomous vehicles and robotics. Rapid categorization of objects in a field of view, along with their inherent characteristics, enables machines to reason about what may happen next. This capability is critical not only for interaction but also for safety. Understanding what a robot is observing fundamentally changes how it chooses to respond.

Additionally, Ai2 is releasing a collection of nine new open datasets used to train Molmo 2, totaling more than nine million multimodal examples across dense video captions, long-form QA grounding, tracking and multi-image reasoning. The captioning dataset alone spans more than 1,000 videos with detailed descriptions that average more than 900 words each.

According to the institute, the corpus of datasets provides a mix of video pointing, multi-object tracking, synthetic grounding and long-video reasoning. Combined, it says, they create the foundation for the most complete open video data collections available today.

All models, datasets and evaluation tools are now publicly available on GitHub, Hugging Face and Ai2 Playground for interactive testing. The institute said it will release the training code soon.

Images: geralt/Pixabay, Allen Institute for AI

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.