Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR
This work addresses bias in ASR for L2 learners, improving accessibility in education, though it is incremental as it builds on existing models like Whisper.
The paper tackled the problem of automatic speech recognition (ASR) underperforming for L2 learners by proposing proficiency-aware multitask learning and targeted data augmentation, which reduced word error rate by up to 29.4% and narrowed proficiency gaps.
General-purpose ASR underperforms for atypical speakers, such as L2 learners, reinforcing bias and limiting use in education and accessibility. Using the CEFR-graded Speak and Improve corpus, we show that naive fine-tuning of Whisper reduces average WER but simultaneously widens disparities and disproportionately harms lower-level learners. To address this, we propose two strategies: (i) proficiency-aware multitask learning, jointly optimizing ASR with proficiency classification, and (ii) targeted augmentation, applying spectrogram masking to low-proficiency speech to counter imbalance. These approaches reduce WER by up to 29.4 percent (relative) and insertion/deletion errors by as much as 58.6 percent (relative). Crucially, despite the severe imbalance of the dataset reflecting real-world distributions, both strategies consistently narrow proficiency gaps, advancing equitable ASR for L2 learners.