SDAIASAug 16, 2024

Efficient Autoregressive Audio Modeling via Next-Scale Prediction

arXiv:2408.09027v213 citationsh-index: 56Has Code
AI Analysis

This work addresses the computational inefficiency of autoregressive models in audio generation, which is crucial for integration into large language models, though it is incremental in nature.

The paper tackles the efficiency problem in autoregressive audio generation by proposing a scale-level tokenizer and modeling framework, achieving 35x faster inference speed and a 1.33 improvement in Fréchet Audio Distance on AudioSet.

Audio generation has achieved remarkable progress with the advance of sophisticated generative models, such as diffusion models (DMs) and autoregressive (AR) models. However, due to the naturally significant sequence length of audio, the efficiency of audio generation remains an essential issue to be addressed, especially for AR models that are incorporated in large language models (LLMs). In this paper, we analyze the token length of audio tokenization and propose a novel \textbf{S}cale-level \textbf{A}udio \textbf{T}okenizer (SAT), with improved residual quantization. Based on SAT, a scale-level \textbf{A}coustic \textbf{A}uto\textbf{R}egressive (AAR) modeling framework is further proposed, which shifts the next-token AR prediction to next-scale AR prediction, significantly reducing the training cost and inference time. To validate the effectiveness of the proposed approach, we comprehensively analyze design choices and demonstrate the proposed AAR framework achieves a remarkable \textbf{35}$\times$ faster inference speed and +\textbf{1.33} Fréchet Audio Distance (FAD) against baselines on the AudioSet benchmark. Code: \url{https://github.com/qiuk2/AAR}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes