ASAISDOct 29, 2024

Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding

arXiv:2410.21951v220 citationsh-index: 13ICASSP
Originality Incremental advance
AI Analysis

This work addresses inference efficiency for text-to-speech applications, representing an incremental improvement in a domain-specific area.

The paper tackled the problem of slow inference time in auto-regressive text-to-speech systems by introducing VADUSA, which uses speculative decoding to accelerate synthesis while improving performance, achieving significant speed gains without quality loss.

The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes