Google’s PaLM-E embeds vision with ChatGPT-style AI model to power autonomous robots
Researchers from Google LLC and the Technical University of Berlin this week unveiled an artificial intelligence-powered robot trained on a multimodal embodied visual-language model with more than 562 billion parameters.
PaLM-E, as the model is called, integrates AI-powered vision and language to enable autonomous robotic control, allowing the robot to perform a wide range of tasks based on human voice commands, without the need for constant retraining. In other words, it’s a robot that can understand what it’s being told to do, then go ahead and carry out those tasks immediately.
For example, if the robot is commanded to “bring me the rice chips from the drawer,” PaLM-E will rapidly create a plan of action, based on the command and its field of vision. Then, the mobile robot platform with a robotic arm that it controls will execute the action, fully autonomously.
PaLM-E works by viewing its immediate surroundings through the robot’s camera, and can do this without any kind of preprocessed scene representation. It simply looks and takes in what it sees, and then works out what it needs to do based on that. That means there’s no need for a human to annotate the visual data first.
Google’s researchers said PaLM-E also can react to changes in the environment as it’s carrying out a task. For instance, if it proceeds to fetch those rice chips, and someone else grabs them from the robot and places them on a table in the room, the robot will see what happened, find the chips, grab them again and bring them to the person who first requested them.
A second example shows how PaLM-E can complete more complex tasks involving sequences, which previously would have required human guidance:
“We demonstrate the performance of PaLM-E on challenging and diverse mobile manipulation tasks,” the researchers wrote. “We largely follow the setup in Ahn et al. (2022), where the robot needs to plan a sequence of navigation and manipulation actions based on an instruction by a human. For example, given the instruction “I spilled my drink, can you bring me something to clean it up?”, the robot needs to plan a sequence containing “1. Find a sponge, 2. Pick up the sponge, 3. Bring it to the user, 4. Put down the sponge.”
PaLM-E is based on an existing large language model known as PaLM that’s integrated with sensory information and robotic control, hence it’s an “embodied visual-language model.” It works by taking continuous observations of its surroundings, encoding this data into a sequence of vectors, similar to how it encodes words as “language tokens.” In this way, it can understand sensory information in the same way that it processes vocal commands.
The researchers added that PaLM-E exhibits a trait known as “positive transfer,” meaning it can transfer knowledge and skills learned from prior tasks to new ones, leading to higher performance than single-task robot models. In addition, the researchers said, it also displays “multimodal chain-of-thought reasoning,” which means it can analyze a sequence of inputs that include both language and visual inputs, as well as “multi-image inference,” where it uses multiple images as an input to make inference or predict something.
All told, PaLM-E is an impressive breakthrough in autonomous robotics, and Google said its next steps will be to explore additional applications in real-world scenarios such as home automation and industrial robotics. The researchers also expressed hope that their work will inspire more research into multimodal reasoning and embodied AI.
Image: Google
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU