UPDATED 09:30 EST / AUGUST 02 2023

Meta releases AudioCraft, a generative AI model for creating music and sound

Meta Platforms Inc.’s Fundamental Artificial Intelligence Research team said today it’s open-sourcing a new, high-quality generative AI framework that’s focused on producing realistic audio sound and music from text-based inputs.

With the new AudioCraft, Meta’s FAIR team says, musicians will be able to explore new compositions without playing a single note on an instrument, while indie video game developers can populate virtual worlds with more realistic sound effects, despite operating on a shoestring budget.

The AudioCraft framework consists of three components, namely MusicGen, AudioGen and EnCodec. In a blog post, the Fair team explained that MusicGen was trained on Meta-owned and specifically licensed music, and it can generate new music from text-based inputs. AudioGen, meanwhile, was trained on public sound effects and generates natural sounds from text inputs.

The EnCodec decoder is the secret sauce that made it possible to train those models on fewer artifacts than previous audio-based generative AI models.

Meta’s FAIR team explains that there has been lots of excitement around generative AI models that create images, video and text, but very little has been heard about similar models that can generate sound, and that’s what it’s aiming to change.

The problem with audio generative AI models is that generating high-fidelity sound requires the modeling of complex signals and patterns at varying scales, the team said. It added that music is probably the most challenging type of audio to generate because it’s composed of both local and long-range patterns.

“Generating coherent music with AI has often been addressed through the use of symbolic representations like MIDI or piano rolls,” FAIR explained. “However, these approaches are unable to fully grasp the intricate timbres, expressive nuances, and stylistic performances found in music. More recent advances leverage self-supervised audio representation learning and a number of hierarchical or cascaded models to generate music, feeding the raw audio into a complex system in order to capture long-range structures in the signal while generating quality audio. But we knew that more could be done in this field.”

The key to addressing these challenges is EnCodec, which can learn discrete audio tokens from raw signals and create a kind of “fixed vocabulary” for music samples, FAIR explained. These discrete audio tokens were then used to train autoregressive language models to generate new sounds and music when converting the tokens back to the audio space with EnCodec’s decoder.

According to FAIR, the AudioCraft framework vastly simplifies the design of audio generative AI models compared to earlier efforts, giving users the full recipe to experiment with its AudioGen and MusicGen models or even develop their own from scratch.

AudioGen was a lot easier to build, the team said, and can generate realistic environmental sounds from a text-based description of that sound. MusicGen is more complex, but can still generate coherent and novel musical pieces, the team said. It was trained on about 400,000 recordings together with their text descriptions and metadata, amounting to about 20,000 hours of music in total.

Constellation Research Inc. Vice President and Principal Analyst Andy Thurai said Meta has a strong pedigree in terms of generative AI thanks to the release of its LLaMA2 model for text generation earlier this year. But whereas that was about chat and conversational AI, AudioCraft is all about generating sounds. He explained that MusicGen is able to produce limited-length basic music based on a textual description.

“While that may sound limited, some of the output I tested sounds as real as if it were produced by humans,” he said. “Even more impressive is that it was trained using Meta-owned and specifically licensed music, which means there are no copyright or IP infringement issues.”

It’s a similar story for AudioGen, the analyst continued, as it was trained on publicly available sound effects to create a wide-range of sounds from textual descriptions.

According to Thurai, both models are likely to be useful. For instance, MusicGen can help someone without any musical knowledge to compose a basic soundtrack for commercial use. Moreover, marketing teams might use it to create appropriate sound effects for commercials, he said.

“Obviously, this cuts into people who do that for a living,” Thurai stated. “The small-time musician, the sound production engineer, or the recording artist who is currently making a living by producing original tracks on demand for customers. These AI tools will be able to create original tracks that sound and feel real, just as if they were created by humans.”

FAIR said it’s making the AudioCraft framework, including AudioGen, MusicGen and EnCodec, available to the wider AI community on an open-source license.

“The models are available for research purposes and to further people’s understanding of the technology,” FAIR said. “We’re excited to give researchers and practitioners access so they can train their own models with their own datasets for the first time and help advance the state of the art.”

Photo: Elviss Railijs Bitāns/Pexels

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Meta releases AudioCraft, a generative AI model for creating music and sound

Photo: Elviss Railijs Bitāns/Pexels

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

SC25

Refresh North America 2025

Meta releases AudioCraft, a generative AI model for creating music and sound

Photo: Elviss Railijs Bitāns/Pexels

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

SC25

Refresh North America 2025

Cookies