ASCLSDMar 4, 2020

AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment

arXiv:2003.01950v163 citations
AI Analysis

This addresses efficiency bottlenecks in text-to-speech synthesis for applications requiring real-time or high-throughput speech generation.

The authors tackled the problem of slow text-to-speech synthesis by proposing AlignTTS, a feed-forward system that predicts mel-spectrograms in parallel without explicit alignment mechanisms. The model achieved state-of-the-art performance with a 0.03 higher mean opinion score than Transformer TTS and was over 50 times faster than real-time on the LJSpeech dataset.

Targeting at both high efficiency and performance, we propose AlignTTS to predict the mel-spectrum in parallel. AlignTTS is based on a Feed-Forward Transformer which generates mel-spectrum from a sequence of characters, and the duration of each character is determined by a duration predictor.Instead of adopting the attention mechanism in Transformer TTS to align text to mel-spectrum, the alignment loss is presented to consider all possible alignments in training by use of dynamic programming. Experiments on the LJSpeech dataset show that our model achieves not only state-of-the-art performance which outperforms Transformer TTS by 0.03 in mean option score (MOS), but also a high efficiency which is more than 50 times faster than real-time.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes