AS CL LG SDAug 30, 2019

Maximizing Mutual Information for Tacotron

Peng Liu, Xixin Wu, Shiyin Kang, Guangzhi Li, Dan Su, Dong Yu

arXiv:1909.01145v215.522 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses synthesis errors in speech synthesis systems, but it is incremental as it builds on existing conditional autoregressive models.

The paper tackles the problem of synthesis errors like missing or repeating words in end-to-end speech synthesis by proposing to maximize mutual information between text conditions and acoustic features, which reduces the utterance error rate.

End-to-end speech synthesis methods already achieve close-to-human quality performance. However compared to HMM-based and NN-based frame-to-frame regression methods, they are prone to some synthesis errors, such as missing or repeating words and incomplete synthesis. We attribute the comparatively high utterance error rate to the local information preference of conditional autoregressive models, and the ill-posed training objective of the model, which describes mostly the training status of the autoregressive module, but rarely that of the condition module. Inspired by InfoGAN, we propose to maximize the mutual information between the text condition and the predicted acoustic features to strengthen the dependency between them for CAR speech synthesis model, which would alleviate the local information preference issue and reduce the utterance error rate. The training objective of maximizing mutual information can be considered as a metric of the dependency between the autoregressive module and the condition module. Experiment results show that our method can reduce the utterance error rate.

View on arXiv PDF Code

Similar