CLAIJan 29, 2025

BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights

arXiv:2501.17790v18 citationsh-index: 18
Originality Incremental advance
AI Analysis

This work addresses the specific challenge of accurate speech synthesis for Taiwanese Mandarin speakers, though it is incremental as it builds upon existing methods.

The paper tackles the problem of polyphone disambiguation in Taiwanese Mandarin Text-to-Speech (TTS) by adapting the CosyVoice system with components like an LLM and OT-CFM, resulting in superior performance in generating high-fidelity speech in general and code-switching contexts.

We present BreezyVoice, a Text-to-Speech (TTS) system specifically adapted for Taiwanese Mandarin, highlighting phonetic control abilities to address the unique challenges of polyphone disambiguation in the language. Building upon CosyVoice, we incorporate a $S^{3}$ tokenizer, a large language model (LLM), an optimal-transport conditional flow matching model (OT-CFM), and a grapheme to phoneme prediction model, to generate realistic speech that closely mimics human utterances. Our evaluation demonstrates BreezyVoice's superior performance in both general and code-switching contexts, highlighting its robustness and effectiveness in generating high-fidelity speech. Additionally, we address the challenges of generalizability in modeling long-tail speakers and polyphone disambiguation. Our approach significantly enhances performance and offers valuable insights into the workings of neural codec TTS systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes