CL AI SD ASJun 17, 2025

Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages

arXiv:2506.14177v13 citationsh-index: 3INTERSPEECH

Originality Incremental advance

AI Analysis

This provides a cost-effective approach for developing CS-ASR systems, benefiting research and industry in multilingual settings like Singapore, though it is incremental as it builds on existing pretrained models.

The study tackled the challenge of building automatic speech recognition (ASR) systems for code-switching (CS) without real CS data by proposing a phrase-level mixing method to generate synthetic CS data, which enhanced ASR performance on monolingual and CS tests, with gains varying across language pairs such as BM-EN showing the highest improvement.

Code-switching (CS), common in multilingual settings, presents challenges for ASR due to scarce and costly transcribed data caused by linguistic complexity. This study investigates building CS-ASR using synthetic CS data. We propose a phrase-level mixing method to generate synthetic CS data that mimics natural patterns. Utilizing monolingual augmented with synthetic phrase-mixed CS data to fine-tune large pretrained ASR models (Whisper, MMS, SeamlessM4T). This paper focuses on three under-resourced Southeast Asian language pairs: Malay-English (BM-EN), Mandarin-Malay (ZH-BM), and Tamil-English (TA-EN), establishing a new comprehensive benchmark for CS-ASR to evaluate the performance of leading ASR models. Experimental results show that the proposed training strategy enhances ASR performance on monolingual and CS tests, with BM-EN showing highest gains, then TA-EN and ZH-BM. This finding offers a cost-effective approach for CS-ASR development, benefiting research and industry.

View on arXiv PDF

Similar