ASSDNov 2, 2020

FeatherTTS: Robust and Efficient attention based Neural TTS

arXiv:2011.00935v15 citations
AI Analysis

This addresses stability and speed problems for industrial TTS applications, though it appears incremental as it builds on existing attention-based methods.

The authors tackled robustness and efficiency issues in attention-based neural TTS by proposing FeatherTTS, which nearly eliminates word skipping and repeating in hard texts while speeding up acoustic feature generation by 3.5 times over Tacotron and achieving 35x faster than real-time on a single CPU.

Attention based neural TTS is elegant speech synthesis pipeline and has shown a powerful ability to generate natural speech. However, it is still not robust enough to meet the stability requirements for industrial products. Besides, it suffers from slow inference speed owning to the autoregressive generation process. In this work, we propose FeatherTTS, a robust and efficient attention-based neural TTS system. Firstly, we propose a novel Gaussian attention which utilizes interpretability of Gaussian attention and the strict monotonic property in TTS. By this method, we replace the commonly used stop token prediction architecture with attentive stop prediction. Secondly, we apply block sparsity on the autoregressive decoder to speed up speech synthesis. The experimental results show that our proposed FeatherTTS not only nearly eliminates the problem of word skipping, repeating in particularly hard texts and keep the naturalness of generated speech, but also speeds up acoustic feature generation by 3.5 times over Tacotron. Overall, the proposed FeatherTTS can be $35$x faster than real-time on a single CPU.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes