RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations
This addresses the need for adaptable speech modeling in Indian languages with accents and intonations, providing a foundational resource and open-source tool for multilingual expressive synthesis.
The authors tackled the problem of controllable and expressive text-to-speech synthesis for Indian languages by introducing RASMALAI, a large-scale dataset with 13,000 hours of speech and 24 million annotations, and developed IndicParlerTTS, which demonstrated high-quality speech generation and reliable attribute synthesis.
We introduce RASMALAI, a large-scale speech dataset with rich text descriptions, designed to advance controllable and expressive text-to-speech (TTS) synthesis for 23 Indian languages and English. It comprises 13,000 hours of speech and 24 million text-description annotations with fine-grained attributes like speaker identity, accent, emotion, style, and background conditions. Using RASMALAI, we develop IndicParlerTTS, the first open-source, text-description-guided TTS for Indian languages. Systematic evaluation demonstrates its ability to generate high-quality speech for named speakers, reliably follow text descriptions and accurately synthesize specified attributes. Additionally, it effectively transfers expressive characteristics both within and across languages. IndicParlerTTS consistently achieves strong performance across these evaluations, setting a new standard for controllable multilingual expressive speech synthesis in Indian languages.