AS CL LG SDJun 8, 2020

MultiSpeech: Multi-Speaker Text to Speech with Transformer

Mingjian Chen, Xu Tan, Yi Ren, Jin Xu, Hao Sun, Sheng Zhao, Tao Qin, Tie-Yan Liu

arXiv:2006.04664v221.8121 citationsh-index: 75

Originality Incremental advance

AI Analysis

This work addresses the problem of improving multi-speaker text-to-speech synthesis for applications requiring diverse and high-quality voice generation, representing an incremental advancement over existing Transformer TTS methods.

The paper tackles the challenge of learning text-to-speech alignment in Transformer-based multi-speaker TTS systems, which is hindered by parallel computation and noisy data, by developing MultiSpeech with specialized components like diagonal constraints and layer normalization, resulting in more robust and higher-quality voice synthesis compared to naive Transformer TTS.

Transformer-based text to speech (TTS) model (e.g., Transformer TTS~\cite{li2019neural}, FastSpeech~\cite{ren2019fastspeech}) has shown the advantages of training and inference efficiency over RNN-based model (e.g., Tacotron~\cite{shen2018natural}) due to its parallel computation in training and/or inference. However, the parallel computation increases the difficulty while learning the alignment between text and speech in Transformer, which is further magnified in the multi-speaker scenario with noisy data and diverse speakers, and hinders the applicability of Transformer for multi-speaker TTS. In this paper, we develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment: 1) a diagonal constraint on the weight matrix of encoder-decoder attention in both training and inference; 2) layer normalization on phoneme embedding in encoder to better preserve position information; 3) a bottleneck in decoder pre-net to prevent copy between consecutive speech frames. Experiments on VCTK and LibriTTS multi-speaker datasets demonstrate the effectiveness of MultiSpeech: 1) it synthesizes more robust and better quality multi-speaker voice than naive Transformer based TTS; 2) with a MutiSpeech model as the teacher, we obtain a strong multi-speaker FastSpeech model with almost zero quality degradation while enjoying extremely fast inference speed.

View on arXiv PDF

Similar