

Google LLC today introduced two new artificial intelligence models, Gemini Robotics and Gemini Robotics-ER, that are optimized to power autonomous machines.
The algorithms are based on the company’s Gemini 2.0 series of large language models. Introduced in December, the LLMs can process not only text but also multimodal data as video. This latter capability enables the new Gemini Robotics and Gemini Robotics-ER models to analyze footage from a robot’s cameras when making decisions.
Gemini Robotics is described as a vision-language-action model. According to Google, robots equipped with the model can perform complex tasks based on natural language instructions. A user could, for example, ask the AI to fold paper into origami shapes or place items in a Ziploc bag.
Historically, teaching an industrial robot a new task required manual programming. The task necessities specialized skills and can consume a significant amount of time. To ease the robot configuration process, Google’s researchers built Gemini Robotics with generality in mind. The company says that the AI can carry out tasks it was not taught to perform during training, which reduces the need for manual programming.
To test how well Gemini Robotics responds to new tasks, Google evaluated it using an AI generalization benchmark. The company determined the algorithm more than doubled the performance of earlier vision-language-action models. According to Google, Gemini Robotics can not only perform tasks it was not taught to perform but also change how it carries out those tasks when environmental conditions change.
“If an object slips from its grasp, or someone moves an item around, Gemini Robotics quickly replans and carries on — a crucial ability for robots in the real world, where surprises are the norm,” Carolina Parada, head of robotics at Google DeepMind, detailed in a blog post.
The other new AI model that the company debuted today, Robotics-ER, is geared toward spatial reasoning. This is a term for the complex sequence of computations that a robot must carry out before it can perform a task. Picking up a coffee mug, for example, requires a robotic arm to find the handle and calculate the angle from which it should be approached.
After developing a plan for how to carry out a task, Gemini Robotics-ER uses Gemini 2.0’s coding capabilities to turn the plan into a configuration script. This script programs the robot in which the AI is installed. If a task proves too complicated for Gemini Robotics-EP, developers can teach it the best course of action with a “handful of human demonstrations.”
“Gemini Robotics-ER can perform all the steps necessary to control a robot right out of the box, including perception, state estimation, spatial understanding, planning and code generation,” Parada wrote. “In such an end-to-end setting the model achieves a 2x-3x success rate compared to Gemini 2.0.”
Google will make Gemini Robotics-ER available to several partners, including Apptronik Inc., a humanoid robot startup that raised $350 million last month. The funding round saw the search giant join as an investor. Google will collaborate with Apptronik to develop humanoid robots equipped with Gemini 2.0.
THANK YOU