CL LG SD ASMay 27, 2025

Phir Hera Fairy: An English Fairytaler is a Strong Faker of Fluent Speech in Low-Resource Indian Languages

Praveen Srinivasa Varadhan, Srija Anand, Soma Siddhartha, Mitesh M. Khapra

arXiv:2505.20693v14.91 citationsh-index: 40

Originality Incremental advance

AI Analysis

This addresses the problem of text-to-speech synthesis for low-resource languages, benefiting speakers and developers in multilingual regions, though it is incremental as it builds on existing pretrained models.

The paper tackled adapting an English TTS model to low-resource Indian languages, finding that fine-tuning with only Indian data was most effective, enabling near-human polyglot fluency and zero-resource synthesis for unseen languages like Bhojpuri and Tulu.

What happens when an English Fairytaler is fine-tuned on Indian languages? We evaluate how the English F5-TTS model adapts to 11 Indian languages, measuring polyglot fluency, voice-cloning, style-cloning, and code-mixing. We compare: (i) training from scratch, (ii) fine-tuning English F5 on Indian data, and (iii) fine-tuning on both Indian and English data to prevent forgetting. Fine-tuning with only Indian data proves most effective and the resultant IN-F5 is a near-human polyglot; that enables speakers of one language (e.g., Odia) to fluently speak in another (e.g., Hindi). Our results show English pretraining aids low-resource TTS in reaching human parity. To aid progress in other low-resource languages, we study data-constrained setups and arrive at a compute optimal strategy. Finally, we show IN-F5 can synthesize unseen languages like Bhojpuri and Tulu using a human-in-the-loop approach for zero-resource TTS via synthetic data generation.

View on arXiv PDF

Similar