CL SD ASSep 17, 2025

CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset

Brian Yan, Injy Hamed, Shuichiro Shimizu, Vasista Lodagala, William Chen, Olga Iakovenko, Bashar Talafha, Amir Hussein, Alexander Polok, Kalvin Chang, Dominik Klement, Sara Althubaiti

CMU

arXiv:2509.14161v110.98 citationsh-index: 63Has CodeINTERSPEECH

Originality Synthesis-oriented

AI Analysis

This dataset addresses the problem of limited resources for developing code-switched speech systems, especially for lower-resourced languages, though it is incremental as it builds on existing data collection methods.

They tackled the lack of diverse datasets for code-switched speech recognition and translation by creating CS-FLEURS, a dataset with 4 test sets covering 113 unique code-switched language pairs across 52 languages, including real and synthetic speech, and providing 128 hours of training data.

We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research. Dataset link: https://huggingface.co/datasets/byan/cs-fleurs.

View on arXiv PDF

Similar