CLSDASApr 3, 2024

Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation

arXiv:2404.02592v181 citationsh-index: 5LREC
Originality Incremental advance
AI Analysis

This addresses a domain-specific issue for Korean TTS, improving speech quality in a low-resource language context, but it is incremental as it builds on existing neural TTS methods.

The paper tackles the problem of pausing errors in Korean text-to-speech (TTS) systems, which degrade speech naturalness, by proposing a framework that models syntactic and acoustic cues for pause formation, achieving improved performance validated through subjective and objective metrics.

Contemporary neural speech synthesis models have indeed demonstrated remarkable proficiency in synthetic speech generation as they have attained a level of quality comparable to that of human-produced speech. Nevertheless, it is important to note that these achievements have predominantly been verified within the context of high-resource languages such as English. Furthermore, the Tacotron and FastSpeech variants show substantial pausing errors when applied to the Korean language, which affects speech perception and naturalness. In order to address the aforementioned issues, we propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns. Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences, despite its training on short audio clips. Architectural design choices are validated through comparisons with baseline models and ablation studies using subjective and objective metrics, thus confirming model performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes