Google enhances Gemini Pro with more natural conversational abilities and improved understanding
Google Cloud said today it’s enhancing its Gemini family of generative artificial intelligence models, giving them more natural and intuitive chat capabilities, so it feels much more like having a conversation with another human.
The updates to Gemini were revealed onstage at Google I/O 2024, the company’s annual developer conference near its Mountain View, California headquarters, where Google showed how the chattier Gemini experience can be used to handle complex tasks within various Google applications.
In addition to the conversational capabilities, Google also revealed it’s adding a larger context window to Gemini 1.5 Pro, its most powerful model, to improve its understanding. With the update to 1 million tokens or fundamental units of data, it’ll be able to make sense of multiple large documents of up to 1,500 pages, or summarize the content of 100 emails. Because Gemini 1.5 Pro is a multimodal model, the larger context window means it will also be able to digest and make sense of an hour-long video or up to 30,000 lines of code.
Chattier than ever
Gemini’s new conversational capabilities appear to be an almost instant response to the real-time, multimodal capabilities unveiled in OpenAI’s most powerful large language model, GPT-4o, which was unveiled in a special event on Monday — though of course Google has been working on the features for some time.
GPT-4o upped the stakes for generative AI models with its ability to respond almost instantly to text, audio and image inputs in a way that’s far more natural, bringing generative AI conversations a step closer to the experience of speaking to another human being. OpenAI claimed an average response time of just 320 milliseconds, similar to humans, while adding the ability to interrupt the model and get it to alter its response based on that input, just as a human might.
With Gemini Live, a new service that will become available to Gemini Advanced subscribers in the coming months, Google is attempting to deliver a similar experience. Users will be able to choose from a range of different voices they want to interact with and speak to it about almost any kind of topic. They’ll be able to speak at their own pace, and even interrupt the model mid-response with a clarifying question, just as they might do when talking to a human.
In a blog post, Sissie Hsaio, Google Cloud’s vice president and general manager of Gemini Experiences and Google Assistant, described various situations where these enhanced conversational abilities might be more useful. For instance, if someone is getting ready for a job interview or rehearsing a speech, she can go to Gemini Live and ask it to help them prepare. It will immediately respond with suggested skills the user might want to highlight to an employer, based on the context of the interview, or offer some tips to help calm nerves before stepping up in front of an audience.
Other use cases include making complex plans, such as a family outing. For example, someone might explain that they’re going to Miami with their family to celebrate Labor Day, and point out that their husband wants to eat fresh seafood, while their son really loves art. Gemini Live will be able to pull the user’s hotel and flight booking information from Gmail and then respond with an appropriate plan.
Hsiao explained that in order to respond to such a request, Gemini needs to do much more than just pull information from the web, like other chatbots do.
“Gemini takes into account your flight timing, meal preferences and information about local museums, while also understanding where each stop is located and how long it will take to travel between each activity,” Hsiao said. “It grabs your flight information from Gmail, taps Google Maps for restaurant and museum recommendations near your hotel, and uses Search to recommend other activities, like a walking tour of the Design District or beach time, to fill out the rest of your day.”
Larger context windows
It’s not just the conversational capabilities of Gemini that are being enhanced, but also its capacity to understand user inputs. Gemini 1.5 Pro is being made available to Gemini Advanced subscribers with a much greater context window that starts at 1 million tokens – a number that Google claims is far more than any other consumer chatbot currently available.
This will give Gemini Pro 1.5 the ability to provide more context in its responses, based on whatever documents and files the user feeds into it. As an example, Hsiao said people might connect Gemini Pro 1.5 to their files within Google Drive and ask it questions based on that content, such as details about the pet policy in their rental agreement, or the key arguments made in multiple research papers. It will also be able to act as a data analyst, delivering insights and building visualizations based on uploaded files and spreadsheets.
Because Gemini 1.5 Pro is multimodal, users can upload more than just text. For instance, they’ll be able to take a photo of whatever they’re eating and have it generate a recipe to recreate the same dish at home.
Another benefit of having a Gemini Advanced subscription will be Gems, which are customized versions of Gemini focused on very specific topics. Users will be able to create Gems such as a gym buddy that can provide tips and feedback on exercise routines, a sous chef that advises them how to cook any meal, a coding partner who can suggest lines of code, or a creative writing guide that helps users to get their thoughts down onto paper.
They simply have to explain what they want their Gem to do, then set the tone of its response. So for instance, a running coach can be set up to respond in a positive, upbeat and motivational way, Google said.
Project Astra: universal AI agents
Customized chatbots are a growing trend in the generative AI industry, but there’s also a need for more knowledgeable, universal AI agents that can be helpful in all manner of different situations. Recognizing this, Google has created an initiative called Project Astra, which it describes as the “future of AI assistants.”
“To be truly useful, an agent needs to understand and respond to the complex and dynamic world just like people do — and take in and remember what it sees and hears to understand context and take action,” said Google DeepMind Chief Executive Demis Hassabis. “It also needs to be proactive, teachable and personal, so users can talk to it naturally and without lag or delay.”
According to Hassabis, the goal of Project Astra is to create more natural sounding conversational AI assistants, which involves enhancing and speeding up their ability to perceive and reason based on the information they’re presented with. His team’s approach to doing this involves “continuously encoding video frames, combining the video and speech input into a timeline of events, and caching this information for efficient recall.”
At the same time, Google has been working to enhance the sound of its universal AI agents, giving them a wider range of intonations so they can respond in a more natural and conversational way.
Expanded model family
Elsewhere on the generative AI front, Google announced a host of less powerful models that it says are customized for different applications and scenarios, including a more lightweight version of Gemini, called 1.5 Flash, an updated Gemini Nano for on-device workloads, and its next generation of image and video generation models.
Gemini 1.5 Flash is designed for applications that require lower latency and lower costs to serve, but although it’s significantly smaller than Gemini Pro 1.5, it has been optimized for high-volume and high-frequency tasks at scale, and features the same one million token context window. It was trained using a process known as “distillation,” where the most essential skills from the larger Gemini Pro were transferred directly to the smaller model, which is said to excel at summarization, chat applications, image and video captioning and data extraction.
Gemini Nano is a special version of the Gemini model that’s designed to run directly on devices such as smartphones and tablets, processing prompts and requests locally instead of sending it to the cloud. Launched last year, it’s now being updated to handle image inputs in addition to text and audio, so it can now experience the world via sight, sound and spoken language.
There’s also a new version of Google’s open-source Gemma model, which is built on the same research and technology behind the Gemini family. Gemma 2 features a new architecture that enhances its performance and efficiency, and it will be made available in various sizes. One of the most interesting new capabilities is PaliGemma, a new vision-language model.
Meanwhile, Veo represents Google’s best effort so far in the nascent field of generative AI video generation. In a blog post, Eli Collins, vice president of product management, explained that Veo is able to generate high-quality 1080p resolution videos that go beyond a minute in length, with support for various cinematic and visual styles. Besides understanding natural language, it can also grasp the concept of visual semantics to help it accurately render user’s ideas.
In addition, Google said, Veo understands cinematic terms such as “timelapse” and “aerial shots of a landscape” to provide greater control to users, helping it to deliver “consistent and coherent” footage of people, animals and objects, with realistic movements, Collins said. It’s currently available for select creators as a private preview inside VideoFX, and there’s a waitlist for anyone else who’s interested in seeing what it can do.
Finally, Google announced the latest version of its flagship text-to-image model, Imagen 3, bringing improved quality and fidelity to AI-generated images. According to Collins, it can generate much greater detail than Imagen 2, producing more lifelike, photorealistic images with fewer distracting visual artifacts.
It also has superior natural language understanding, enabling it to incorporate smaller details better from longer prompts. It’s being made available in private preview within ImageFX for select creators from today, with everyone else invited to sign up to a waitlist.
Image: Google/YouTube
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU