Meta AI builds speech recognition platform that uses visual cues to filter out background noise
Facebook parent company Meta Platforms Inc. is trying to tackle one of the biggest problems in artificial intelligence-based speech recognition: background noise.
Modern AI speech recognition systems don’t always work that well in situations where there’s lots of noise, or if multiple people are speaking at the same time. They generally use sophisticated noise-suppression techniques to try to filter out that noise, but those are often no match for the human ability to combine hearing with vision.
To solve the problem, Meta AI has created a new conversational AI framework, called Audio-Visual Hidden Unit BERT, that aims to train AI models by both hearing and seeing people speak.
Meta AI said AV-HuBERT is the first system of its kind that can jointly model speech and lip movements from unlabeled video that hasn’t been transcribed.
It’s far superior to today’s speech recognition models that only use audio as their input. Such models have to guess whether someone speaking is just one person or multiple people, or if certain sounds are just background noise. By adding vision into the mix, AV-HuBERT doesn’t need to do this, because it can work out who is talking using lip-movement cues.
“By combining visual cues, such as the movement of the lips and teeth during speaking, along with auditory information for representation learning, AV-HuBERT can capture nuanced associations between the two input streams efficiently even with much smaller amounts of untranscribed video data for pretraining,” Meta AI’s researchers explained. “Once the pretrained model learns the structure and correlation well, only a small amount of labeled data is needed to train a model for a particular task or a different language.”
Meta AI reckons AV-HuBERT is extremely capable at this, delivering 75% more accuracy than the next best audio-visual speech recognition systems available. It also requires just 10% of the data needed by other systems to achieve those results.
That’s important, Meta AI said, because it is difficult to obtain large amounts of labeled audio-visual data for many of the world’s languages. It means AV-HuBERT can be used to build noise-robust speech recognition systems in far more languages than is possible with other frameworks.
“Since it requires far less supervised data for training, it will also open up possibilities for developing conversational AI models for hundreds of millions of people around the globe who don’t speak languages such as English, Mandarin, and Spanish, which have large-scale labeled datasets,” the researchers said.
Potential applications include digital assistants on smartphones that can understand what users are saying under any circumstances, such as if the person is at a packed football stadium or in a noisy factory. It could also help to detect deepfakes by capturing the fine correlations between sounds and mouth movements. Another possibility might be generating realistic lip movements for virtual reality avatars, delivering a more realistic sense of presence.
Meta AI said it’s making the code and a batch of pre-trained AV-HuBERT models available to other researchers working in the domain, in the hope that the broader research community can build on its work and accelerate progress in audio-visual speech recognition.
Photo: Meta AI
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU