CLOct 27, 2025

Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?

Tawsif Tashwar Dipto, Azmol Hossain, Rubayet Sabbir Faruque, Md. Rezuwan Hassan, Kanij Fatema, Tanmoy Shome, Ruwad Naswan, Md. Foriduzzaman Zihad, Mohaymen Ul Anam, Nazia Tasnim, Hasan Mahmud, Md Kamrul Hasan

arXiv:2510.23252v23 citationsh-index: 16IJCNLP-AACL

Originality Synthesis-oriented

AI Analysis

This addresses the problem of poor ASR performance for regional dialects in low-resource languages, which is incremental as it highlights limitations of existing methods rather than introducing a new solution.

The study investigated whether ASR foundation models can handle regional dialects in low-resource languages, finding that they struggle significantly with Bengali dialects, achieving only 40% accuracy in zero-shot settings, and that dialect-specific training improves performance.

Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue. Our dataset also serves as a out of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available

View on arXiv PDF

Similar