CLSDASAug 21, 2025

UniCoM: A Universal Code-Switching Speech Generator

arXiv:2508.15244v12 citationsh-index: 5EMNLP
Originality Incremental advance
AI Analysis

This addresses the problem of limited data for code-switching speech technology, enabling more inclusive multilingual systems, though it is incremental as it builds on existing methods for dataset generation.

The paper tackles the scarcity of datasets for code-switching speech by proposing UniCoM, a pipeline that generates high-quality code-switching samples without altering semantics, resulting in the CS-FLEURS corpus that performs comparably to existing datasets on objective and subjective metrics.

Code-switching (CS), the alternation between two or more languages within a single speaker's utterances, is common in real-world conversations and poses significant challenges for multilingual speech technology. However, systems capable of handling this phenomenon remain underexplored, primarily due to the scarcity of suitable datasets. To resolve this issue, we propose Universal Code-Mixer (UniCoM), a novel pipeline for generating high-quality, natural CS samples without altering sentence semantics. Our approach utilizes an algorithm we call Substituting WORDs with Synonyms (SWORDS), which generates CS speech by replacing selected words with their translations while considering their parts of speech. Using UniCoM, we construct Code-Switching FLEURS (CS-FLEURS), a multilingual CS corpus designed for automatic speech recognition (ASR) and speech-to-text translation (S2TT). Experimental results show that CS-FLEURS achieves high intelligibility and naturalness, performing comparably to existing datasets on both objective and subjective metrics. We expect our approach to advance CS speech technology and enable more inclusive multilingual systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes