ASAICLLGSDJun 8, 2024

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

arXiv:2406.05551v142 citations
Originality Highly original
AI Analysis

This addresses the problem of generative capability constraints in audio language models for text-to-speech synthesis, offering a novel approach that improves quality and reduces latency, though it is incremental in advancing existing diffusion and transformer methods.

The paper tackles the limitations of low-bitrate audio tokenization in text-to-speech synthesis by proposing an autoregressive diffusion transformer (ARDiT) that encodes audio as continuous vectors, achieving performance comparable to or surpassing state-of-the-art models, with one model generating 170 ms of 24 kHz speech per step with minimal degradation.

Audio language models have recently emerged as a promising approach for various audio generation tasks, relying on audio tokenizers to encode waveforms into sequences of discrete symbols. Audio tokenization often poses a necessary compromise between code bitrate and reconstruction accuracy. When dealing with low-bitrate audio codes, language models are constrained to process only a subset of the information embedded in the audio, which in turn restricts their generative capabilities. To circumvent these issues, we propose encoding audio as vector sequences in continuous space $\mathbb R^d$ and autoregressively generating these sequences using a decoder-only diffusion transformer (ARDiT). Our findings indicate that ARDiT excels in zero-shot text-to-speech and exhibits performance that compares to or even surpasses that of state-of-the-art models. High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing. Our experiments reveal that employing Integral Kullback-Leibler (IKL) divergence for distillation at each autoregressive step significantly boosts the perceived quality of the samples. Simultaneously, it condenses the iterative sampling process of the diffusion model into a single step. Furthermore, ARDiT can be trained to predict several continuous vectors in one step, significantly reducing latency during sampling. Impressively, one of our models can generate $170$ ms of $24$ kHz speech per evaluation step with minimal degradation in performance. Audio samples are available at http://ardit-tts.github.io/ .

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes