CLSDASFeb 1, 2025

Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language

arXiv:2502.00421v12 citationsh-index: 22Has CodeICASSP
Originality Synthesis-oriented
AI Analysis

This addresses the problem of underrepresented language resources for Oromo speakers and researchers, but it is incremental as it applies existing methods to new data.

The authors tackled the lack of Automatic Speech Recognition (ASR) resources for the Oromo language by creating an open-source dataset of 100 hours of audio recordings with transcriptions, achieving a Word Error Rate (WER) of 10.82% using a fine-tuned Whisper model.

We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowd-sourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and noisy environments. This dataset addresses the critical need for ASR resources for the Oromo language which is underrepresented. To show its applicability for the ASR task, we conducted experiments using the Conformer model, achieving a Word Error Rate (WER) of 15.32% with hybrid CTC and AED loss and WER of 18.74% with pure CTC loss. Additionally, fine-tuning the Whisper model resulted in a significantly improved WER of 10.82%. These results establish baselines for Oromo ASR, highlighting both the challenges and the potential for improving ASR performance in Oromo. The dataset is publicly available at https://github.com/turinaf/sagalee and we encourage its use for further research and development in Oromo speech processing.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes