SDCLLGASAug 31, 2023

Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

arXiv:2308.16593v16 citationsh-index: 52
Originality Incremental advance
AI Analysis

This work addresses the problem of making conversational speech synthesis more human-like for applications in AI assistants and communication tools, though it is incremental as it builds on existing methods with a focus on data and label enhancement.

The paper tackled the challenge of synthesizing spontaneous-style speech in conversational text-to-speech by proposing a semi-supervised pre-training method to increase data and labels for spontaneous behaviors, achieving superior expressive speech synthesis performance with the ability to model and predict spontaneous behavior.

The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style. However, synthesizing spontaneous-style speech is challenging due to the lack of high-quality spontaneous datasets and the high cost of labeling spontaneous behavior. In this paper, we propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels. In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech. Moreover, a linguistic-aware encoder is used to model the relationship between each sentence in the conversation. Experimental results indicate that our proposed method achieves superior expressive speech synthesis performance with the ability to model spontaneous behavior in spontaneous-style speech and predict reasonable spontaneous behavior from text.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes