CLSDASAug 1, 2024

Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

arXiv:2408.00284v14 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses the need for better dialectal speech synthesis in Chinese, which is important for applications in regions with diverse dialects, though it appears incremental as it builds on existing TTS methods.

The paper tackles the problem of generating high-quality Chinese dialectal speech, which current large-scale TTS models struggle with, by proposing Bailing-TTS, a family of models that achieve human-like spontaneous representation in experiments.

Large-scale text-to-speech (TTS) models have made significant progress recently.However, they still fall short in the generation of Chinese dialectal speech. Toaddress this, we propose Bailing-TTS, a family of large-scale TTS models capable of generating high-quality Chinese dialectal speech. Bailing-TTS serves as a foundation model for Chinese dialectal speech generation. First, continual semi-supervised learning is proposed to facilitate the alignment of text tokens and speech tokens. Second, the Chinese dialectal representation learning is developed using a specific transformer architecture and multi-stage training processes. With the proposed design of novel network architecture and corresponding strategy, Bailing-TTS is able to generate Chinese dialectal speech from text effectively and efficiently. Experiments demonstrate that Bailing-TTS generates Chinese dialectal speech towards human-like spontaneous representation. Readers are encouraged to listen to demos at \url{https://c9412600.github.io/bltts_tech_report/index.html}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes