SD LG ASOct 24, 2021

Discrete Acoustic Space for an Efficient Sampling in Neural Text-To-Speech

Marek Strong, Jonas Rohnke, Antonio Bonafonte, Mateusz Łajszczak, Trevor Wood

arXiv:2110.12539v34.33 citationsh-index: 26

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficient and high-quality speech synthesis for expressive task-oriented dialogues, representing an incremental improvement over existing VAE-based methods.

The paper tackles the problem of inefficient sampling in neural text-to-speech by proposing a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture, which achieves a statistically significant improvement in naturalness over VAE and VQ-VAE models and reduces the gap between constant vector synthesis and vocoded recordings by 32%.

We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture using a split vector quantizer for NTTS, as an enhancement to the well-known Variational Autoencoder (VAE) and Vector Quantized Variational Autoencoder (VQ-VAE) architectures. Compared to these previous architectures, our proposed model retains the benefits of using an utterance-level bottleneck, while keeping significant representation power and a discretized latent space small enough for efficient prediction from text. We train the model on recordings in the expressive task-oriented dialogues domain and show that SVQ-VAE achieves a statistically significant improvement in naturalness over the VAE and VQ-VAE models. Furthermore, we demonstrate that the SVQ-VAE latent acoustic space is predictable from text, reducing the gap between the standard constant vector synthesis and vocoded recordings by 32%.

View on arXiv PDF

Similar