AS AI SDOct 29, 2024

Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding

Bohan Li, Hankun Wang, Situo Zhang, Yiwei Guo, Kai Yu

arXiv:2410.21951v220 citationsh-index: 13ICASSP

Originality Incremental advance

AI Analysis

This work addresses inference efficiency for text-to-speech applications, representing an incremental improvement in a domain-specific area.

The paper tackled the problem of slow inference time in auto-regressive text-to-speech systems by introducing VADUSA, which uses speculative decoding to accelerate synthesis while improving performance, achieving significant speed gains without quality loss.

The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.

View on arXiv PDF

Similar