Stability ai Unveils “Stable Audio”: A Revolutionary Latent Diffusion Model for Audio Generation
Stability ai, the innovative tech company, has recently introduced “Stable Audio,” a latent diffusion model designed to revolutionize audio generation. This groundbreaking technology promises another significant leap forward for generative ai, allowing users to control the content and length of generated audio, including complete songs.
Overcoming Limitations in Traditional Audio Diffusion Models
Historically, audio diffusion models have been limited to generating audio of fixed durations, leading to abrupt and incomplete musical phrases. This issue stems from the models being trained on randomly cropped audio chunks taken from longer files and then forced into predetermined lengths.
Stable Audio: Generation with Specified Lengths
Stable Audio effectively addresses this challenge, enabling the generation of audio with specified lengths, up to the training window size. The model’s unique approach significantly reduces inference times by using a heavily downsampled latent representation of audio.
Core Architecture: VAE, Text Encoder, and Conditioned Diffusion Model
The core architecture of Stable Audio consists of a variational autoencoder (VAE), a text encoder, and a U-Net-based conditioned diffusion model. The VAE plays a pivotal role by compressing stereo audio into a noise-resistant, lossy latent encoding that significantly expedites both generation and training processes.
Text Conditioning: Leveraging Text Features
To harness the influence of text prompts, Stability ai utilizes a text encoder derived from a BERT-based model specially trained on their dataset. This enables the model to imbue text features with information about the relationships between words and sounds.
Conditioning for Desired Audio Lengths
During training, the model learns to incorporate two key properties from audio chunks: the starting second (“seconds_start”) and the total duration of the original audio file (“seconds_total”). These properties are transformed into discrete learned embeddings per second, which are then concatenated with the text prompt tokens. This unique conditioning allows users to specify the desired length of the generated audio during inference.
Advanced Diffusion Model: 907 Million Parameters
The diffusion model at the heart of Stable Audio boasts a staggering 907 million parameters and leverages a sophisticated blend of residual layers, self-attention layers, and cross-attention layers to denoise the input while considering text and timing embeddings. The model incorporates memory-efficient implementations of attention for enhanced memory efficiency and scalability.
Training the Flagship Model: Over 800,000 Audio Files
To train the flagship Stable Audio model, Stability ai curated an extensive dataset comprising over 800,000 audio files encompassing music, sound effects, and single-instrument stems. The rich dataset, provided in partnership with a prominent stock music provider, amounts to an impressive 19,500 hours of audio.
Future Developments and Upcoming Releases
Stable Audio marks the vanguard of audio generation research, coming from Stability ai’s generative audio research lab, Harmonai. The team remains dedicated to advancing model architectures, refining datasets, and enhancing training procedures, aiming for higher output quality, greater controllability, faster inference speeds, and longer achievable output lengths.
Stability ai’s Recent Milestones
Stability ai has recently joined the Partnership on ai, pledging to uphold voluntary safety standards for ai as part of its second round.
Try Stable Audio for yourself and explore other upcoming enterprise technology events and webinars powered by TechForge.