AS CL SDMar 4, 2020

AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment

Zhen Zeng, Jianzong Wang, Ning Cheng, Tian Xia, Jing Xiao

arXiv:2003.01950v117.363 citations

Originality Highly original

AI Analysis

This addresses efficiency bottlenecks in text-to-speech synthesis for applications requiring real-time or high-throughput speech generation.

The authors tackled the problem of slow text-to-speech synthesis by proposing AlignTTS, a feed-forward system that predicts mel-spectrograms in parallel without explicit alignment mechanisms. The model achieved state-of-the-art performance with a 0.03 higher mean opinion score than Transformer TTS and was over 50 times faster than real-time on the LJSpeech dataset.

Targeting at both high efficiency and performance, we propose AlignTTS to predict the mel-spectrum in parallel. AlignTTS is based on a Feed-Forward Transformer which generates mel-spectrum from a sequence of characters, and the duration of each character is determined by a duration predictor.Instead of adopting the attention mechanism in Transformer TTS to align text to mel-spectrum, the alignment loss is presented to consider all possible alignments in training by use of dynamic programming. Experiments on the LJSpeech dataset show that our model achieves not only state-of-the-art performance which outperforms Transformer TTS by 0.03 in mean option score (MOS), but also a high efficiency which is more than 50 times faster than real-time.

View on arXiv PDF

Similar