TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation
This addresses the challenge of low-resource language processing for Tibetan speakers by enabling multi-dialect speech synthesis, though it is incremental as it builds on existing TTS methods.
The paper tackled the problem of limited parallel speech corpora for Tibetan dialects by proposing TMD-TTS, a unified text-to-speech framework that synthesizes dialectal speech, resulting in significantly outperforming baselines in dialectal expressiveness as validated through objective and subjective evaluations.
Tibetan is a low-resource language with limited parallel speech corpora spanning its three major dialects (Ü-Tsang, Amdo, and Kham), limiting progress in speech modeling. To address this issue, we propose TMD-TTS, a unified Tibetan multi-dialect text-to-speech (TTS) framework that synthesizes parallel dialectal speech from explicit dialect labels. Our method features a dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects. Extensive objective and subjective evaluations demonstrate that TMD-TTS significantly outperforms baselines in dialectal expressiveness. We further validate the quality and utility of the synthesized speech through a challenging Speech-to-Speech Dialect Conversion (S2SDC) task.