CLNov 7, 2024

Towards Building Large Scale Datasets and State-of-the-Art Automatic Speech Translation Systems for 14 Indian Languages

Microsoft
arXiv:2411.04699v35 citationsh-index: 40Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited speech translation resources for Indian languages, enabling better real-world applications, though it is incremental as it builds on existing methods with new data.

The authors tackled the scarcity of large-scale speech translation datasets for Indian languages by introducing BhasaAnuvaad, a dataset with over 44 thousand hours of audio and 17 million aligned text segments across 14 languages, and trained IndicSeamless, a model that outperforms existing ones in translation quality.

Speech translation for Indian languages remains a challenging task due to the scarcity of large-scale, publicly available datasets that capture the linguistic diversity and domain coverage essential for real-world applications. Existing datasets cover a fraction of Indian languages and lack the breadth needed to train robust models that generalize beyond curated benchmarks. To bridge this gap, we introduce BhasaAnuvaad, the largest speech translation dataset for Indian languages, spanning over 44 thousand hours of audio and 17 million aligned text segments across 14 Indian languages and English. Our dataset is built through a threefold methodology: (a) aggregating high-quality existing sources, (b) large-scale web crawling to ensure linguistic and domain diversity, and (c) creating synthetic data to model real-world speech disfluencies. Leveraging BhasaAnuvaad, we train IndicSeamless, a state-of-the-art speech translation model for Indian languages that performs better than existing models. Our experiments demonstrate improvements in the translation quality, setting a new standard for Indian language speech translation. We will release all the code, data and model weights in the open-source, with permissive licenses to promote accessibility and collaboration.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes