Meta releases AudioCraft, a generative AI model for creating music and sound
Meta Platforms Inc.’s Fundamental Artificial Intelligence Research team said today it’s open-sourcing a new, high-quality generative AI framework that’s focused on producing realistic audio sound and music from text-based inputs.
With the new AudioCraft, Meta’s FAIR team says, musicians will be able to explore new compositions without playing a single note on an instrument, while indie video game developers can populate virtual worlds with more realistic sound effects, despite operating on a shoestring budget.
The AudioCraft framework consists of three components, namely MusicGen, AudioGen and EnCodec. In a blog post, the Fair team explained that MusicGen was trained on Meta-owned and specifically licensed music, and it can generate new music from text-based inputs. AudioGen, meanwhile, was trained on public sound effects and generates natural sounds from text inputs.
The EnCodec decoder is the secret sauce that made it possible to train those models on fewer artifacts than previous audio-based generative AI models.
Meta’s FAIR team explains that there has been lots of excitement around generative AI models that create images, video and text, but very little has been heard about similar models that can generate sound, and that’s what it’s aiming to change.
The problem with audio generative AI models is that generating high-fidelity sound requires the modeling of complex signals and patterns at varying scales, the team said. It added that music is probably the most challenging type of audio to generate because it’s composed of both local and long-range patterns.
“Generating coherent music with AI has often been addressed through the use of symbolic representations like MIDI or piano rolls,” FAIR explained. “However, these approaches are unable to fully grasp the intricate timbres, expressive nuances, and stylistic performances found in music. More recent advances leverage self-supervised audio representation learning and a number of hierarchical or cascaded models to generate music, feeding the raw audio into a complex system in order to capture long-range structures in the signal while generating quality audio. But we knew that more could be done in this field.”
The key to addressing these challenges is EnCodec, which can learn discrete audio tokens from raw signals and create a kind of “fixed vocabulary” for music samples, FAIR explained. These discrete audio tokens were then used to train autoregressive language models to generate new sounds and music when converting the tokens back to the audio space with EnCodec’s decoder.
According to FAIR, the AudioCraft framework vastly simplifies the design of audio generative AI models compared to earlier efforts, giving users the full recipe to experiment with its AudioGen and MusicGen models or even develop their own from scratch.
AudioGen was a lot easier to build, the team said, and can generate realistic environmental sounds from a text-based description of that sound. MusicGen is more complex, but can still generate coherent and novel musical pieces, the team said. It was trained on about 400,000 recordings together with their text descriptions and metadata, amounting to about 20,000 hours of music in total.
Constellation Research Inc. Vice President and Principal Analyst Andy Thurai said Meta has a strong pedigree in terms of generative AI thanks to the release of its LLaMA2 model for text generation earlier this year. But whereas that was about chat and conversational AI, AudioCraft is all about generating sounds. He explained that MusicGen is able to produce limited-length basic music based on a textual description.
“While that may sound limited, some of the output I tested sounds as real as if it were produced by humans,” he said. “Even more impressive is that it was trained using Meta-owned and specifically licensed music, which means there are no copyright or IP infringement issues.”
It’s a similar story for AudioGen, the analyst continued, as it was trained on publicly available sound effects to create a wide-range of sounds from textual descriptions.
According to Thurai, both models are likely to be useful. For instance, MusicGen can help someone without any musical knowledge to compose a basic soundtrack for commercial use. Moreover, marketing teams might use it to create appropriate sound effects for commercials, he said.
“Obviously, this cuts into people who do that for a living,” Thurai stated. “The small-time musician, the sound production engineer, or the recording artist who is currently making a living by producing original tracks on demand for customers. These AI tools will be able to create original tracks that sound and feel real, just as if they were created by humans.”
FAIR said it’s making the AudioCraft framework, including AudioGen, MusicGen and EnCodec, available to the wider AI community on an open-source license.
“The models are available for research purposes and to further people’s understanding of the technology,” FAIR said. “We’re excited to give researchers and practitioners access so they can train their own models with their own datasets for the first time and help advance the state of the art.”
Photo: Elviss Railijs Bitāns/Pexels
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU