ASCLLGSDJun 8, 2020

MultiSpeech: Multi-Speaker Text to Speech with Transformer

arXiv:2006.04664v2121 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of improving multi-speaker text-to-speech synthesis for applications requiring diverse and high-quality voice generation, representing an incremental advancement over existing Transformer TTS methods.

The paper tackles the challenge of learning text-to-speech alignment in Transformer-based multi-speaker TTS systems, which is hindered by parallel computation and noisy data, by developing MultiSpeech with specialized components like diagonal constraints and layer normalization, resulting in more robust and higher-quality voice synthesis compared to naive Transformer TTS.

Transformer-based text to speech (TTS) model (e.g., Transformer TTS~\cite{li2019neural}, FastSpeech~\cite{ren2019fastspeech}) has shown the advantages of training and inference efficiency over RNN-based model (e.g., Tacotron~\cite{shen2018natural}) due to its parallel computation in training and/or inference. However, the parallel computation increases the difficulty while learning the alignment between text and speech in Transformer, which is further magnified in the multi-speaker scenario with noisy data and diverse speakers, and hinders the applicability of Transformer for multi-speaker TTS. In this paper, we develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment: 1) a diagonal constraint on the weight matrix of encoder-decoder attention in both training and inference; 2) layer normalization on phoneme embedding in encoder to better preserve position information; 3) a bottleneck in decoder pre-net to prevent copy between consecutive speech frames. Experiments on VCTK and LibriTTS multi-speaker datasets demonstrate the effectiveness of MultiSpeech: 1) it synthesizes more robust and better quality multi-speaker voice than naive Transformer based TTS; 2) with a MutiSpeech model as the teacher, we obtain a strong multi-speaker FastSpeech model with almost zero quality degradation while enjoying extremely fast inference speed.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes