CLDec 22, 2025

Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara

Yacouba Diarra, Panga Azazia Kamate, Nouhoum Souleymane Coulibaly, Michael Leventhal

arXiv:2512.19400v12.71 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This work addresses the need for robust ASR in low-resource languages like Bambara, though it is incremental as it applies existing methods to new data.

The authors tackled the problem of automatic speech recognition (ASR) for Bambara, a predominantly oral language, by creating a 160-hour dataset from Malian radio archives that includes real-world challenges like code-switching and background noise. Finetuning models on this dataset reduced word error rates from 44.47% to 37.12% and from 36.07% to 32.33% on test sets, and it outperformed a model trained on cleaner data in human evaluation.

We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47\% to 37.12\% on one and from 36.07\% to 32.33\% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.

View on arXiv PDF

Similar