CLASJun 13, 2024

An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

arXiv:2406.08911v15 citations
Originality Synthesis-oriented
AI Analysis

This is an incremental study addressing the problem of adapting TTS systems to low-resource languages for speech technology applications.

This paper investigated language adaptation for TTS systems in low-resource scenarios, finding that phonetic similarity, language category, dataset size, and number of speakers affect performance, and that audio-only data can outperform paired data in fine-tuning.

Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes