ASAICLFeb 2

WAXAL: A Large-Scale Multilingual African Language Speech Corpus

arXiv:2602.02734v13 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This provides a crucial resource for developing inclusive technologies and preserving languages for over 100 million speakers, though it is incremental as it focuses on data collection rather than novel methods.

The authors tackled the digital divide in speech technology for Sub-Saharan African languages by introducing WAXAL, a large-scale multilingual speech corpus with approximately 1,250 hours of transcribed ASR data and over 180 hours of TTS recordings for 21 languages.

The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts. This paper details our methodology for data collection, annotation, and quality control, which involved partnerships with four African academic and community organizations. We provide a detailed statistical overview of the dataset and discuss its potential limitations and ethical considerations. The WAXAL datasets are released at https://huggingface.co/datasets/google/WaxalNLP under the permissive CC-BY-4.0 license to catalyze research, enable the development of inclusive technologies, and serve as a vital resource for the digital preservation of these languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes