SDCLASFeb 12, 2021

VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention

arXiv:2102.06431v124 citations
Originality Incremental advance
AI Analysis

This work addresses the inference speed bottleneck in text-to-speech synthesis for applications requiring real-time processing, representing an incremental improvement over existing non-autoregressive methods.

The paper tackled the problem of slow inference in text-to-speech synthesis by proposing VARA-TTS, a non-autoregressive model based on a very deep VAE with residual attention, which achieved an order-of-magnitude speed-up compared to an autoregressive counterpart while maintaining slightly inferior speech quality.

This paper proposes VARA-TTS, a non-autoregressive (non-AR) text-to-speech (TTS) model using a very deep Variational Autoencoder (VDVAE) with Residual Attention mechanism, which refines the textual-to-acoustic alignment layer-wisely. Hierarchical latent variables with different temporal resolutions from the VDVAE are used as queries for residual attention module. By leveraging the coarse global alignment from previous attention layer as an extra input, the following attention layer can produce a refined version of alignment. This amortizes the burden of learning the textual-to-acoustic alignment among multiple attention layers and outperforms the use of only a single attention layer in robustness. An utterance-level speaking speed factor is computed by a jointly-trained speaking speed predictor, which takes the mean-pooled latent variables of the coarsest layer as input, to determine number of acoustic frames at inference. Experimental results show that VARA-TTS achieves slightly inferior speech quality to an AR counterpart Tacotron 2 but an order-of-magnitude speed-up at inference; and outperforms an analogous non-AR model, BVAE-TTS, in terms of speech quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes