CLSDASNov 15, 2021

Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data

arXiv:2111.07549v13 citations
Originality Incremental advance
AI Analysis

This work addresses prosody issues in speech synthesis for unseen texts, which is an incremental improvement for text-to-speech systems.

The paper tackled the problem of unnatural prosody in speech synthesis for unseen texts by combining a fine-tuned BERT-based front-end with a pre-trained FastSpeech2-based acoustic model, resulting in improved prosody, especially for structurally complex sentences.

Recent advancements in end-to-end speech synthesis have made it possible to generate highly natural speech. However, training these models typically requires a large amount of high-fidelity speech data, and for unseen texts, the prosody of synthesized speech is relatively unnatural. To address these issues, we propose to combine a fine-tuned BERT-based front-end with a pre-trained FastSpeech2-based acoustic model to improve prosody modeling. The pre-trained BERT is fine-tuned on the polyphone disambiguation task, the joint Chinese word segmentation (CWS) and part-of-speech (POS) tagging task, and the prosody structure prediction (PSP) task in a multi-task learning framework. FastSpeech 2 is pre-trained on large-scale external data that are noisy but easier to obtain. Experimental results show that both the fine-tuned BERT model and the pre-trained FastSpeech 2 can improve prosody, especially for those structurally complex sentences.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes