UPDATED 13:15 EDT / JUNE 24 2022


Meta is building better AI-driven audio for virtual reality

When it comes to virtual reality, creating immersive worlds is more than just generating visually perfect environments. The way that sound works can make or break an experience.

To tackle the audio challenge, researchers at Meta Platforms Inc. today open-sourced three artificial intelligence models that take sound in the metaverse to a new level.

“Getting spatial audio right is key to delivering a realistic sense of presence in the metaverse,” said Mark Zuckerberg, founder and chief executive of Meta. “If you’re at a concert, or just talking with friends around a virtual table, a realistic sense of where sound is coming from makes you feel like you’re actually there.”

Things sound different across different environments. For example, everyone has the experience of singing in an enclosed space such as the shower or talking in the park. The experience is entirely different. There’s also the way that friends’ voices reflect off walls in the living room of a house or the low murmur in a restaurant.

This is the essence of the first model, called the Visual Acoustic Matching model, which uses an image of the space to adjust sounds so that they match the target environment. For example, it could take an audio clip of a person speaking in an open field and match it to someplace cozy and intimate, making the voice sound closer and echo off nearby walls.

“Human listeners, without us even realizing it, are expecting to hear sounds in a certain way depending on the physical environment that we’re in,” said Kristen Garuman, research director at Meta AI. “That’s because audio is shaped by the environment we’re in.”

This could be useful for meetings with friends in the metaverse because although when we don VR headsets, we might get whisked away to a forest campsite to talk with our friends, we don’t actually leave our living room or home office. The recordings of our voices still keep the sounds that are generated by the spaces that we’re in, so the AI model can change that sound to match the gloaming-lit forest we’re in and make it that much more immersive.

The next model does the opposite. It takes knowledge of the environment and takes away echoes that might be generated by surfaces sound could bounce off, called reverberations, in order to create cleaner, crisper sound. The Visually Informed Dereverberation model could be used to take a violinist’s performance in a massive train station and turn it into something that sounded like it was played in a studio.

The result is potentially better audio in general for recording from headsets worn in homes and home offices for speech enhancement, speaker identification and speech recognition purposes. With less echo sneaking into the audio, smart agents – and even people listening on the other end – would have a better time understanding speech.

Finally, in the metaverse things will probably get a little bit noisy when lots of people are talking nearby, potentially over one another. Visual Voice takes a page from humans, who can listen with more than just their ears – they also use their eyes for clues in mouth movements and facial expressions.

The objective of VisualVoice is to disentangle individual voices from background noises and other voices that might be speaking at the same time and identify individual speakers. The result is that the AI model can provide better accessibility and potentially even create subtitles that attach to those speakers. It could even be used for smart agents to focus on and identify individuals in crowds.

With these new AI models, Meta hopes to supply superior audio to immersive AR and VR experiences in the future. Virtual reality is already providing profound experiences with visual representations of spaces, so it’s important that the quality of the sound keeps up with it.

Garuman sees a future where this AI audio research will provide truly unique experiences for people in the metaverse, such as visiting a concert.

“As soon as you put on your headset the sounds from your home would fade away and the audio would adjust realistically as you move from the hallway into the concert hall and closer to the stage,” she said. “And, if you wanted, AI could enhance the experience so that you could enjoy the experience and still hear your friend next to you.”

Image: Meta

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy