SDAIASOct 25, 2022

Semi-Supervised Learning Based on Reference Model for Low-resource TTS

arXiv:2210.14723v16 citationsh-index: 22
Originality Incremental advance
AI Analysis

This work addresses the challenge of achieving high-quality TTS in low-resource settings, which is incremental as it builds on existing methods like Fastspeech2.

The paper tackles the problem of low-resource text-to-speech (TTS) by proposing a semi-supervised learning method that uses a pre-trained reference model and pseudo labels to improve performance with limited target data, achieving significant improvements in voice quality, naturalness, and robustness.

Most previous neural text-to-speech (TTS) methods are mainly based on supervised learning methods, which means they depend on a large training dataset and hard to achieve comparable performance under low-resource conditions. To address this issue, we propose a semi-supervised learning method for neural TTS in which labeled target data is limited, which can also resolve the problem of exposure bias in the previous auto-regressive models. Specifically, we pre-train the reference model based on Fastspeech2 with much source data, fine-tuned on a limited target dataset. Meanwhile, pseudo labels generated by the original reference model are used to guide the fine-tuned model's training further, achieve a regularization effect, and reduce the overfitting of the fine-tuned model during training on the limited target data. Experimental results show that our proposed semi-supervised learning scheme with limited target data significantly improves the voice quality for test data to achieve naturalness and robustness in speech synthesis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes