ASLGJul 22, 2025

Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages

arXiv:2507.16875v1
Originality Synthesis-oriented
AI Analysis

This work addresses speech generation challenges for low-resource Indian languages, offering incremental insights into duration prediction strategies for zero-shot, speaker-specific tasks.

The study tackled high-quality speech generation for low-resource Indian languages by evaluating duration prediction strategies in a non-autoregressive Continuous Normalizing Flow model, finding that infilling-based predictors improve intelligibility in some languages while speaker-prompted predictors better preserve speaker characteristics in others.

High-quality speech generation for low-resource languages, such as many Indian languages, remains a significant challenge due to limited data and diverse linguistic structures. Duration prediction is a critical component in many speech generation pipelines, playing a key role in modeling prosody and speech rhythm. While some recent generative approaches choose to omit explicit duration modeling, often at the cost of longer training times. We retain and explore this module to better understand its impact in the linguistically rich and data-scarce landscape of India. We train a non-autoregressive Continuous Normalizing Flow (CNF) based speech model using publicly available Indian language data and evaluate multiple duration prediction strategies for zero-shot, speaker-specific generation. Our comparative analysis on speech-infilling tasks reveals nuanced trade-offs: infilling based predictors improve intelligibility in some languages, while speaker-prompted predictors better preserve speaker characteristics in others. These findings inform the design and selection of duration strategies tailored to specific languages and tasks, underscoring the continued value of interpretable components like duration prediction in adapting advanced generative architectures to low-resource, multilingual settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes