ASSDDec 7, 2020

EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture

arXiv:2012.03500v145 citations
AI Analysis

This work provides a more efficient and higher-quality text-to-speech solution for researchers and developers working on speech synthesis applications.

This paper introduces EfficientTTS, a non-autoregressive text-to-speech architecture that achieves high-quality speech synthesis efficiently. It outperforms Tacotron 2 and Glow-TTS in speech quality, training efficiency, and synthesis speed, while maintaining robustness and diversity.

In this work, we address the Text-to-Speech (TTS) task by proposing a non-autoregressive architecture called EfficientTTS. Unlike the dominant non-autoregressive TTS models, which are trained with the need of external aligners, EfficientTTS optimizes all its parameters with a stable, end-to-end training procedure, while allowing for synthesizing high quality speech in a fast and efficient manner. EfficientTTS is motivated by a new monotonic alignment modeling approach (also introduced in this work), which specifies monotonic constraints to the sequence alignment with almost no increase of computation. By combining EfficientTTS with different feed-forward network structures, we develop a family of TTS models, including both text-to-melspectrogram and text-to-waveform networks. We experimentally show that the proposed models significantly outperform counterpart models such as Tacotron 2 and Glow-TTS in terms of speech quality, training efficiency and synthesis speed, while still producing the speeches of strong robustness and great diversity. In addition, we demonstrate that proposed approach can be easily extended to autoregressive models such as Tacotron 2.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes