CLDec 1, 2025

Swivuriso: The South African Next Voices Multilingual Speech Dataset

Vukosi Marivatee, Kayode Olaleye, Sitwala Mundia, Andinda Bakainga, Unarine Netshifhefhe, Mahmooda Milanzie, Tsholofelo Hope Mogale, Thapelo Sindane, Zainab Abdulrasaq, Kesego Mokgosi, Chijioke Okorie, Nia Zion Van Wyk

arXiv:2512.02201v12.71 citationsh-index: 14

Originality Synthesis-oriented

AI Analysis

This dataset supports ASR development for underrepresented South African languages, but it is incremental as it applies existing methods to new data.

The paper introduces Swivuriso, a 3000-hour multilingual speech dataset for seven South African languages, addressing gaps in ASR resources and providing baseline results for model training and comparison.

This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.

View on arXiv PDF

Similar