ASCLLGSDAug 9, 2020

SpeedySpeech: Efficient Neural Speech Synthesis

arXiv:2008.03802v151 citations
Originality Incremental advance
AI Analysis

This work addresses the need for efficient and high-quality speech synthesis systems, particularly for applications requiring low computational resources, and is incremental as it builds on existing neural sequence-to-sequence models.

The paper tackles the problem of achieving fast training, fast inference, and high-quality audio synthesis simultaneously in neural speech synthesis, proposing a student-teacher network that uses simple convolutional blocks and a single attention layer, resulting in significantly higher voice quality ratings than Tacotron 2 and real-time performance on a CPU.

While recent neural sequence-to-sequence models have greatly improved the quality of speech synthesis, there has not been a system capable of fast training, fast inference and high-quality audio synthesis at the same time. We propose a student-teacher network capable of high-quality faster-than-real-time spectrogram synthesis, with low requirements on computational resources and fast training time. We show that self-attention layers are not necessary for generation of high quality audio. We utilize simple convolutional blocks with residual connections in both student and teacher networks and use only a single attention layer in the teacher model. Coupled with a MelGAN vocoder, our model's voice quality was rated significantly higher than Tacotron 2. Our model can be efficiently trained on a single GPU and can run in real time even on a CPU. We provide both our source code and audio samples in our GitHub repository.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes