AI
AI
AI
Seattle-based artificial intelligence research institute Ai2, the Allen Institute for AI, today announced the release of MolmoAct 7B, a breakthrough open embodied AI model that brings intelligence to robotics by allowing them to “think” through actions before performing.
Spatial reasoning isn’t new for AI models, which are capable of reasoning about the world by visualizing images or video and then drawing conclusions about them. For example, a user can upload an image or video to OpenAI’s ChatGPT and ask questions about how to assemble a desk and receive an answer. Similarly, robotics AI foundation models can be told to pick up a cup and place it in the sink.
“Embodied AI needs a new foundation that prioritizes reasoning, transparency and openness,” said Chief Executive Ali Farhadi. “With MolmoAct, we’re not just releasing a model; we’re laying the groundwork for a new era of AI, bringing the intelligence of powerful AI models into the physical world.”
Most robotics AI models operate by reasoning about the language provided to them, breaking down natural language sentences — such as the example above, “Pick up the cup on the counter and put it in the sink” — and turning them into actions. They do this by combining knowledge gained from cameras and other sensors and a command.
Ai2 said MolmoAct is the first in a new category of AI models the company is calling an action reasoning model, or ARM, that interprets high-level natural language and then reasons through a plan of physical actions to carry them out in the real world. Unlike current robotics models on the market that operate as vision language action foundation models, ARMs break down instructions into a series of waypoints and actions that take into account what the model can see.

“As soon as it sees the world, it lifts the entire world into 3D and then it draws a trajectory to define how its arms are going to move in that space,” Ranjay Krishna, the computer vision team lead at Ai2, told SiliconANGLE in an interview. “So, it plans for the future. And after it’s done planning, only then does it start taking actions and moving its joints.”
Both ARM and VLA models act as “brains” for robots and include examples such as pi-zero from AI model robotics startup Physical Intelligence, Nvidia Corp.’s GR00T N1 for humanoid robots, OpenVLA, a 7 billion-parameter open-source model commonly used by academic researchers for experiments, and Octo, a 93 billion-parameter model. Parameters refer to the number of internal variables the model uses to make decisions and predictions. MolmoAct contains 7 billion parameters, hence the 7B in its name.
The company used 18 million samples on a cluster of 256 Nvidia H100 graphics processing units to train the model, finishing pre-training in about a day. The fine tuning took 64 H100s about 2 hours. By comparison Nvidia’s GR00T-N2-2B was trained on 600 million samples with 1,024 H100s, while Physical Intelligence trained pi-zero using 900 million samples and an undisclosed number of chips.
“A lot of these companies give you these tech reports, but these tech reports kind of look like this: They have this big black box in the middle that says, ‘transformer,’ right? And beyond that, you really don’t know what’s going on,” said Krishna.
Unlike many current models on the market, MolmoAct 7B was trained on a curated open dataset of around 12,000 “robot episodes” from real-world environments, such as kitchens and bedrooms. These demonstrations were used to map goal-oriented actions — such as arranging pillows and putting away laundry.
Krishna explained that MolmoAct overcomes this industry transparency challenge by being fully open, providing its code, weights and evaluations, thus resolving the “black box problem.” It is both trained on open data and its inner workings are transparent and openly available.
To add even more control, users can preview the model’s planned movements before execution, with its intended motion trajectories overlaid on camera images. These plans can be modified using natural language or by sketching corrections on a touchscreen.
This provides a fine-grained method for developers or robotics technicians to control robots in different settings such as homes, hospitals and warehouses.
Ai2 said the company evaluated MolmoAct’s pre-training capabilities using SimPLER, a benchmark that uses a set of simulated test environments for common real-world robot setups. Using the benchmark, the model achieved state-of-the-art task success rates of 72.1%, beating models from Physical Intelligence, Google LLC, Microsoft Corp. and Nvidia.
“MolmoAct is our first sort of foray into this space showing that reasoning models are the right way of going for training these large-scale foundation models for robotics,” said Krishna. “Our mission is to enable real world applications, so anybody out there can download our model and then fine tune it for any sort of purposes that they have, or try using it out of the box.”
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.