Meta AI open-sources tools for self-supervised training of speech recognition models
Meta Platforms Inc.’s artificial intelligence research team today said it has open-sourced a new project called Massively Multilingual Speech, which aims to overcome the challenges of creating accurate and reliable speech recognition models.
AI models that can recognize human speech and respond to it clearly have a lot of potential, especially for people who rely entirely on voice access to obtain information. However, training high-quality models generally requires enormous amounts of data — thousands of hours of audio, together with transcriptions of what’s being said. For many languages, especially the more obscure ones, that data simply doesn’t exist.
Meta’s MMS project does away with the requirement by combining a self-supervised learning algorithm called wav2vec 2.0 with a new dataset that provides labeled data for over 1,100 languages, and unlabeled data for almost 4,000 languages.
To overcome the lack of data for certain languages, Meta’s researchers turned to the Bible, which unlike most other books has already been translated into many thousands of languages. Its translations are often studied for text-based language translation research, and for many there are also publicly available audio recordings of people reading these texts.
“As part of this project, we created a dataset of readings of the New Testament in over 1,100 languages, which provided on average 32 hours of data per language,” Meta’s researchers said.
Of course, 32 hours of data is not enough to train a conventional supervised speech recognition model, and that’s why wav2vec 2.0 was used. Wav2vec 2.0 is a self-supervised learning algorithm that enables machines to learn without relying on labeled training data.
With it, it’s possible to train speech recognition models on far less data. The MMS project trained multiple, self-supervised models on around 500,000 hours of speech data in over 1,400 languages, before fine-tuning the resulting models for a specific speech task, such as multilingual speech recognition or language identification.
Meta said the resulting models performed well on both standard benchmarks such as FLEURS and in comparison with other speech recognition models.
“We trained multilingual speech recognition models on over 1,100 languages using a 1B parameter wav2vec 2.0 model,” Meta’s researchers explained. “As the number of languages increases, performance does decrease, but only very slightly: Moving from 61 to 1,107 languages increases the character error rate by only about 0.4% but increases the language coverage by over 17 times.”
In a direct comparison with OpenAI LP’s Whisper speech recognition model, Meta’s researchers found that models trained on MMS data achieved approximately half the word error rate. “This demonstrates that our model can perform very well compared with the best current speech models,” the researchers said.
Meta said it’s now sharing its MMS dataset and the tools used to refine and train its models so that others in the AI research community can build on this work. Meta’s goals for MMS include expanding its coverage to support even more languages and also improve its handling of dialects, which is a major challenge for existing speech technologies.
“Our goal is to make it easier for people to access information and to use devices in their preferred language,” the researchers said. “We also envision a future where a single model can solve several speech tasks for all languages. While we trained separate models for speech recognition, speech synthesis, and language identification, we believe that in the future, a single model will be able to accomplish all these tasks and more, leading to better overall performance.”
Image: Meta Platforms
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU