ASCLLGSDMLDec 12, 2018

FPETS : Fully Parallel End-to-End Text-to-Speech System

arXiv:1812.05710v51 citationsHas Code
Originality Highly original
AI Analysis

This addresses the problem of slow inference and errors in TTS for applications requiring real-time synthesis, representing a novel advancement rather than an incremental improvement.

The authors tackled the high latency and error modes in end-to-end text-to-speech systems by proposing FPETS, a fully parallel non-autoregressive system, achieving speedups of up to 600x compared to state-of-the-art methods while maintaining or improving audio quality.

End-to-end Text-to-speech (TTS) system can greatly improve the quality of synthesised speech. But it usually suffers form high time latency due to its auto-regressive structure. And the synthesised speech may also suffer from some error modes, e.g. repeated words, mispronunciations, and skipped words. In this paper, we propose a novel non-autoregressive, fully parallel end-to-end TTS system (FPETS). It utilizes a new alignment model and the recently proposed U-shape convolutional structure, UFANS. Different from RNN, UFANS can capture long term information in a fully parallel manner. Trainable position encoding and two-step training strategy are used for learning better alignments. Experimental results show FPETS utilizes the power of parallel computation and reaches a significant speed up of inference compared with state-of-the-art end-to-end TTS systems. More specifically, FPETS is 600X faster than Tacotron2, 50X faster than DCTTS and 10X faster than Deep Voice3. And FPETS can generates audios with equal or better quality and fewer errors comparing with other system. As far as we know, FPETS is the first end-to-end TTS system which is fully parallel.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes