UPDATED 13:35 EDT / JUNE 16 2023

Meta introduces Voicebox, a generative AI model for speech

Researchers at the artificial intelligence labs of Meta Platforms Inc. today announced a breakthrough with a generative AI model for speech dubbed “Voicebox,” which can accomplish a wide variety of tasks such as synthesizing speech, styling and editing content.

According to researchers, what large language models such as OpenAI LP’s ChatGPT and diffusion models such as DALL-E did for text and images, Voicebox is now capable of doing for speech.

“Like generative systems for images and text, Voicebox creates outputs in a vast variety of styles, and it can create outputs from scratch as well as modify a sample it’s given,” the Meta AI researchers said in a blog post. “But instead of creating a picture or a passage of text, Voicebox produces high-quality audio clips.”

Voicebox is a broadly capable model that can synthesize speech across six different languages without specialized training. It can also do content editing – including fixing interruptions — style conversion and generate samples in diverse voices.

All the model needs to learn is raw audio and its accompanying transcription. According to researchers, other models cannot generalize across multiple tasks and must be pretrained specifically for different tasks with specialized training. That sets the Voicebox model apart as it can do multiple different tasks without any specific training.

To make Voicebox sound more “human,” researchers built the model on the Flow Matching model, which allows the generative AI to learn from varied speech data without needing the variations to be specifically labeled. That allows the AI to perform different tasks and permits the training data to be ingested at a larger scale.

“We trained Voicebox with more than 50,000 hours of recorded speech and transcripts from public domain audiobooks in English, French, Spanish, German, Polish, and Portuguese,” the researchers said. “Voicebox is trained to predict a speech segment when given the surrounding speech and the transcript of the segment.”

According to the research, using Flow Matching, the model has achieved better results than Microsoft Corp.’s VALL-E model in terms of intelligibility — 5.9% versus 1.9% word error rates — and audio similarity, while running as much as 20 times faster.

Voicebox can use as little as two seconds of audio to match a sample’s style and use it for text-to-speech generation. It could be used for future applications for individuals who cannot speak, virtual assistants and voice acting in video games.

The model is also capable of infilling speech from context, predicting what words may have been spoken, and determining how they should sound, should they be interrupted in the middle of a clip. As a result, it can seamlessly edit audio clips if a speech is interrupted by short-duration, noise such as a dog barking.

Having been trained on numerous voices, Voicebox is also capable of simulating natural speech that is more representative of how people talk in the real world across the six languages that it is currently capable of using. That means it can be tuned to produce a variety of voices, tones and cadences, and even modify voice audio clips to match a different style or tone.

Although the researchers noted that this is an exciting breakthrough, they urged caution about its capabilities and its potential for misuse. As a result, the Voicebox model and its code are not being made available for public consumption.

“While we believe it is important to be open with the AI community and to share our research to advance the state of the art in AI, it’s also necessary to strike the right balance” between openness and responsibility, the researchers said.

This concern is not without precedent, since voice simulation has existed for years and has been used for nefarious purposes before. Microsoft’s VALL-E model has similarly not been released to the public because of its capability of simulating people’s voices and thus creating a potential for misuse.

Right now, the information on Voicebox that Meta AI is sharing is in the form of the announcement, audio samples and a research paper detailing the results it has achieved.

Image: Racool_studio/Freepik

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

Are you AWS customer? Support SiliconANGLE Financially by buying your AWS services from our Marketplace portal page and links.

https://siliconangle.com/aws-marketplace/

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Meta introduces Voicebox, a generative AI model for speech

Image: Racool_studio/Freepik

A message from John Furrier, co-founder of SiliconANGLE:

Are you AWS customer? Support SiliconANGLE Financially by buying your AWS services from our Marketplace portal page and links.

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

RAISE Summit 2026

Pure Accelerate 2026

FinOps X 2026

Snowflake Summit 2026

Freshworks Refresh 2026

Meta introduces Voicebox, a generative AI model for speech

Image: Racool_studio/Freepik

A message from John Furrier, co-founder of SiliconANGLE:

Are you AWS customer? Support SiliconANGLE Financially by buying your AWS services from our Marketplace portal page and links.

LATEST STORIES

LATEST STORIES

RAISE Summit 2026

Pure Accelerate 2026

FinOps X 2026

Snowflake Summit 2026

Freshworks Refresh 2026