SD AI LG MM ASMay 13, 2025

Fast Text-to-Audio Generation with Adversarial Post-Training

Zachary Novack, Zach Evans, Zack Zukowski, Josiah Taylor, CJ Carr, Julian Parker, Adnan Al-Sinan, Gian Marco Iodice, Julian McAuley, Taylor Berg-Kirkpatrick, Jordi Pons

arXiv:2505.08175v315 citationsh-index: 15WASPAA

Originality Highly original

AI Analysis

This addresses latency issues for creative applications by providing a fast text-to-audio generation method, though it is incremental as it builds on existing diffusion/flow models with adversarial post-training enhancements.

The paper tackles the slow inference time of text-to-audio systems by introducing Adversarial Relativistic-Contrastive (ARC) post-training, a novel adversarial acceleration algorithm for diffusion/flow models, resulting in a model that generates approximately 12 seconds of 44.1kHz stereo audio in about 75ms on an H100 and 7 seconds on a mobile edge-device, making it the fastest text-to-audio model known.

Text-to-audio systems, while increasingly performant, are slow at inference time, thus making their latency unpractical for many creative applications. We present Adversarial Relativistic-Contrastive (ARC) post-training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation. While past adversarial post-training methods have struggled to compare against their expensive distillation counterparts, ARC post-training is a simple procedure that (1) extends a recent relativistic adversarial formulation to diffusion/flow post-training and (2) combines it with a novel contrastive discriminator objective to encourage better prompt adherence. We pair ARC post-training with a number optimizations to Stable Audio Open and build a model capable of generating $\approx$12s of 44.1kHz stereo audio in $\approx$75ms on an H100, and $\approx$7s on a mobile edge-device, the fastest text-to-audio model to our knowledge.

View on arXiv PDF

Similar