UPDATED 12:00 EST / MARCH 31 2022

Meta advances textless natural language processing to generate more expressive AI speech

Meta Platforms Inc.’s artificial intelligence research team said today it has made big progress as it strives to create more realistic AI-generated speech systems.

Its latest advances in what it calls “textless natural language processing” mean it’s now able to model expressive vocalizations, such as laughter, yawning and cries, in addition to “spontaneous chit-chat” in real time.

Meta’s work has to do with so-called Generative Spoken Language Models, which are a breakthrough natural language processing model that make it possible to build speech recognition systems without using any transcribed audio data to train them.

In a blog post, Meta AI’s team explained that traditional AI systems remain quite limited in their ability to capture rich, expressive non-verbal signals in speech, such as intonations, emotional expressions, pauses, accents and rhythms, all of which can play a key role in human interactions. That’s because those systems can only learn from written text, which captures what people say but not how they say it.

Meta’s GSLMs are different, because they enable natural language processing models to capture the full expressive nature of oral language. It’s a powerful capability, and Meta said it has been training its GSLMs to use that data, either to build downstream applications or as a generative tool for creating language from an audio prompt.

The result is that Meta says it can now model expressive vocalizations that are essential to understanding the context of an interaction in the same way as a person would. Such vocalizations allow AI systems to convey nuances about their communicative intent, Meta explained, or the sentiment they want to convey – such as boredom, irony and irritation.

In addition, Meta said it’s now able to model spontaneous real-time chit-chat between two AI agents in a highly realistic way. The agents can factor in behavior such as the occasional overlap or pause, “ums” and “ahs” and so on. Meta said this is an important development because it will allow virtual agents, for example, to recognize more nuanced social cues and signals. AI systems will also be able to interpret whether nonvocal expressions suggest positive or negative feedback, Meta said.

Meta explained that its intent is to create more natural and engaging AI speech systems. For instance, it’s planning to apply textless model training techniques to build more useful downstream applications that don’t rely on resource-intensive text labels or automatic speech recognition systems, such as apps that can answer questions about the weather and so on.

“We believe prosody in speech can help better parse a sentence, which in turn facilitates understanding the intent and improves the performance of question answering,” Meta said.

Another potential use case is speech-to-speech translation, which might be useful for dubbing movies. Most AI dubbing systems work by translating the text of a movie’s script in a roundabout way. First, the audio is transcribed into text, then translated, then finally converted back into audio. It’s extremely complicated and completely removes the expressivity of oral language as it misses out on idiomatic expressions unique to oral language. Meta said its GSLMs remove the need for text-based dubbing, meaning it can potentially come up with far more realistic audio translations.

“Because self-supervised speech representation approaches are able to learn discrete units from raw audio, it’s now possible to remove the need for text and replace it with the pseudo text extracted from each of the target and source languages,” Meta’s AI team explained.

As a final benefit, Meta said the advancement of textless natural language processing would help to make AI more inclusive. Traditional NLP applications need to be trained with enormous text resources, which mean they’re available in only a handful of languages. By training such systems from oral speech alone, textless NLP will bring the benefits of AI speech to hundreds of languages that lack a standardized writing system, including Swiss German, dialectal Arabic and many more.

Image: Meta

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Meta advances textless natural language processing to generate more expressive AI speech

Image: Meta

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

SC25

Refresh North America 2025

Meta advances textless natural language processing to generate more expressive AI speech

Image: Meta

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

SC25

Refresh North America 2025

Cookies