CLMar 4, 2024

IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

arXiv:2403.01926v154 citationsh-index: 51Has CodeACL
Originality Incremental advance
AI Analysis

This provides a foundational resource for speech technology in India, addressing diversity gaps for researchers and developers, though it is incremental in building on existing data collection methods.

The authors tackled the lack of inclusive multilingual speech data for Indian languages by creating INDICVOICES, a dataset of 7348 hours of speech from 22 languages, and used it to build IndicASR, the first ASR model supporting all 22 scheduled languages.

We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes