CLSDASNov 7, 2019

Teacher-Student Training for Robust Tacotron-based TTS

arXiv:1911.02839v240 citations
Originality Incremental advance
AI Analysis

This addresses robustness issues in text-to-speech systems for users handling diverse or unseen data, though it is incremental as it builds on existing Tacotron2 methods.

The paper tackles the exposure bias problem in autoregressive neural TTS models, which causes unpredictable performance on out-of-domain data, by proposing a teacher-student training scheme with a distillation loss, resulting in consistent voice quality improvements for out-of-domain test data in Chinese and English systems.

While neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways, the exposure bias problem in the autoregressive models remains an issue to be resolved. The exposure bias problem arises from the mismatch between the training and inference process, that results in unpredictable performance for out-of-domain test data at run-time. To overcome this, we propose a teacher-student training scheme for Tacotron-based TTS by introducing a distillation loss function in addition to the feature loss function. We first train a Tacotron2-based TTS model by always providing natural speech frames to the decoder, that serves as a teacher model. We then train another Tacotron2-based model as a student model, of which the decoder takes the predicted speech frames as input, similar to how the decoder works during run-time inference. With the distillation loss, the student model learns the output probabilities from the teacher model, that is called knowledge distillation. Experiments show that our proposed training scheme consistently improves the voice quality for out-of-domain test data both in Chinese and English systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes