SDLGASMay 16

Taming Audio VAEs via Target-KL Regularization

arXiv:2605.1708571.7
Predicted impact top 25% in SD · last 90 daysOriginality Incremental advance
AI Analysis

For researchers in audio generation, this provides a principled method to balance VAE regularization and latent predictability, improving latent diffusion model performance.

The paper proposes target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion analysis and identifying optimal compression for text-to-sound generation.

Latent diffusion models have emerged as the dominant paradigm for many generation tasks including audio generation such as text-to-audio, text-to-music and text-to-speech. A key component of latent diffusion is an autoencoder (VAE) that compresses high-dimensional signals into a low frame rate continuous representation that is conducive for downstream prediction. Regularizing these VAEs is challenging, as there is a trade-off between over-regularized (poor output quality) and under-regularized (difficult to predict) latent representations. We propose a framework for studying this trade-off through compression and train Audio VAEs at specific bitrates via target-KL regularization. This allows direct comparison to well-studied discrete neural audio codec models, and the construction of rate-distortion curves for audio VAEs. We evaluate the impact of target-KL regularization on text-to-sound generation and find that sweeping compression rates is helpful in identifying the optimal generation setting.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes