AfriHuBERT: A self-supervised speech representation model for African languages
This work addresses the problem of limited speech representation models for a large number of African languages, benefiting over 600 million African language speakers.
This paper introduces AfriHuBERT, a self-supervised speech representation model for African languages, expanding mHuBERT-147 to cover 1,226 languages. It achieves a +3.6% F1 score improvement for Spoken Language Identification and a -2.1% average Word Error Rate reduction for Automatic Speech Recognition on the FLEURS benchmark.
In this work, we present AfriHuBERT, an extension of mHuBERT-147, a compact self-supervised learning (SSL) model pretrained on 147 languages. While mHuBERT-147 covered 16 African languages, we expand this to 1,226 through continued pretraining on 10K+ hours of speech data from diverse sources, benefiting an African population of over 600M. We evaluate AfriHuBERT on two key speech tasks, Spoken Language Identification (SLID) and Automatic Speech Recognition (ASR), using the FLEURS benchmark. Our results show a +3.6% F1 score improvement for SLID and a -2.1% average Word Error Rate (WER) reduction for ASR over mHuBERT-147, and demonstrates competitiveness with larger SSL models such as MMS and XEUS. Further analysis shows that ASR models trained on AfriHuBERT exhibit improved cross-corpus generalization and are competitive in extremely low-resource ASR scenarios.