UPDATED 14:30 EST / MAY 13 2024

The letters "GPT-4o" on an abstract pink and blue background AI

OpenAI unleashes GPT-4o, a new flagship model with real-time multimodal capabilities

OpenAI upped its artificial intelligence game today with a new flagship AI model named GPT-4o that can respond in real time to text, audio and image inputs, promoting more natural human-computer interactions.

The company says GPT-4o, with the “o” standing for “omni,” is a step toward making talking to an AI model feel more like speaking to or working with another human being. It can respond to voice inputs with an average of 320 milliseconds, which is similar to the human response time. It also matches GPT-4 Turbo in performance on text in English, with significant improvements to non-English languages.

“This is the first time that we’re making a huge step forward when it comes to the ease of use,” said Mira Murati, chief technology officer of OpenAI. “Until now, with voice mode, we had three models that come together to deliver this experience. We had transcription, intelligence and then text-to-speech all together in orchestration to deliver voice mode. This also brings a lot of latency to the experience, which breaks the immersion in collaboration with ChatGPT. Now, with GPT-4o, this all happens natively.”

The new model will soon be accessible to ChatGPT users for free, as it will be rolled out to power its experiences under the hood. OpenAI announced a version of ChatGPT that users can access without an account in April and today the company announced a desktop version for MacOS for free and paid users.

In a demonstration, OpenAI researchers showed onstage how the new model under the hood of ChatGPT is capable of real-time voice conversation, providing the sensation of a real person on the other line with near-instant emotive responses. The new model can also produce a broad range of emotional responses, that it can incorporate into its voice, including chuckling, that sensation of a “smile” in speech, soft sighs and other verbal queues that people associate with a human speaker.

During the demonstration, OpenAI asked the model to tell a bedtime story and had the model introduce drama into the tale, to which the model became more bombastic and grandiose in its tone. It told a bedtime story about a robot and while it was doing so the presenters continuously asked it to update its tone – right up until they requested the model tell the story in a “robotic voice,” and finish the story in a “singsong voice.” The model complied each time adroitly, shifting its tone and even playfully responding with “Initiating dramatic robotic voice.”

The demo also showed that you can interrupt the model when it’s speaking, meaning that it doesn’t need to finish a sentence before asking it about something else. This ability makes interacting with the model a lot more like a conversation, where sometimes interruptions are needed just to get a point across.

Since the model is “multimodal,” it’s also able to “see” images and video, which means that it can hold conversations about what’s happening on the screen or through the camera. To show off this capability OpenAI demonstrated by asking the model to watch as a math equation was written on a piece of paper.

The researchers showed it “3x + 1 = 4,” and asked the model to help them solve for x, but not tell them the answer. It then tutored them in how to solve the equation through the steps to find value, which ended up being “x = 1.” During the demo, ChatGPT managed to be a patient and thoughtful tutor.

The ChatGPT app can also be used to help with coding, and even if it cannot see what’s on the screen, it’s possible to copy code and send it to the app. From there, a developer can hold a conversation out loud with the model about the code. It’s also possible to share the entire screen with the model, allowing it to discuss the context of the screen.

Another use for GPT-4o within ChatGPT with its voice capability in multilingual capacity is that it can work as a real-time cross-translator. The model has improved quality and speed across 50 different languages, crossing 97% of the world’s population, so a user could ask the model, “Could you translate Italian into English and vice versa for me and my friend?” and it could provide that service. In the OpenAI demonstration, it even added a little personal touch with statements such as, “Your friend asked.”

Although the access to GPT-4o will be free as OpenAI rolls it out in ChatGPT, paid users will still have five times the capacity limits of free users. GPT-4o is also available to the application programming interface for developers. It’s twice as fast, 50% cheaper and provides five times higher rate limits than the GPT-4 Turbo model.

Image: OpenAI

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU