

Stability AI Ltd. today introduced a new iteration of Stable Audio, its artificial intelligence system for generating sound clips, that offers a significantly expanded feature set.
The original version of the AI made its debut last September. Stable Audio 1.0, as the first-generation model is known, can generate audio files up to 90 seconds in length. The Stable Audio 2.0 model that Stability AI launched today can generate tracks up to twice as long and with more user-provided customizations.
The previous iteration of the system generates audio based on text prompts. Stable Audio 2.0, meanwhile, is capable of ingesting not only text but also existing sound clips supplied by the user. The AI can match the style of the audio it generates to those clips, which enables customers to more precisely align outputted files with their requirements.
Stable Audio 2.0 also introduces other enhancements. Stability AI says that the model can generate “structured compositions that include an intro, development, and outro.” Another improvement over the previous-generation system is that Stable Audio 2.0 can generate sound effects.
The new capabilities are the result of a major upgrade to the underlying AI architecture.
Like its predecessor, Stable Audio 2.0 is based on a so-called diffusion model design. Diffusion models are neural networks widely used for generating media files. What sets them apart from other AI algorithms is the way they’re trained: During development, they receive a collection of sound clips containing errors and are given the task of restoring the original audio.
Stable Audio 2.0 uses a specialized implementation of the technology known as a latent diffusion model. Like other neural networks, such models are trained on a dataset similar to the files they’ll process in production. But before training begins, the dataset is transformed into a mathematical structure called a latent space that makes the AI development process more efficient.
A latent space contains only the most important details from the dataset on which it’s based. Less relevant details are removed, which reduces the total amount of information that AI models have to process during training. This decrease in data volumes cuts the amount of hardware necessary for AI training, which in turn lowers costs.
The first iteration of Stable Audio was also based on a latent diffusion model. The new version released today features a more efficient mechanism for generating latent spaces. “It captures and reproduces the essential features while filtering out less important details for more coherent generations,” the company detailed in a blog post.
Stability AI’s engineers also added in a new neural network based on the Transformer architecture. Developed by Google LLC in 2017, the architecture is primarily used to build language models. A Transformer can take into account a large amount of contextual information when interpreting a piece of data, which enables it to produce more accurate results than earlier neural networks.
“The combination of these two elements results in a model capable of recognizing and reproducing the large-scale structures that are essential for high-quality musical compositions,” Stability AI detailed.
Stable Audio 2.0 is available at no charge to consumers via a website that the company has created for the model. It’s set to become accessible through an application programming interface “soon.” The API will allow other companies to integrate Stable Audio 2.0 into their applications.
THANK YOU