ASSDNov 8, 2020

Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement

arXiv:2011.03943v422 citations
AI Analysis

This work addresses the problem of expressive speech synthesis for TTS applications, representing an incremental improvement with novel method elements.

The paper tackles fine-grained style modeling and transfer in text-to-speech synthesis by disentangling content and style at the phone level, achieving better content preservation compared to other models as shown in objective and subjective evaluations.

This paper presents a novel design of neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling is realized by extracting style embeddings from the mel-spectrograms of phone-level speech segments. Collaborative learning and adversarial learning strategies are applied in order to achieve effective disentanglement of content and style factors in speech and alleviate the "content leakage" problem in style modeling. The proposed system can be used for varying-content speech style transfer in the single-speaker scenario. The results of objective and subjective evaluation show that our system performs better than other fine-grained speech style transfer models, especially in the aspect of content preservation. By incorporating a style predictor, the proposed system can also be used for text-to-speech synthesis. Audio samples are provided for system demonstration https://daxintan-cuhk.github.io/pl-csd-speech .

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes