Zero-Shot Context-Aware ASR for Diverse Arabic Varieties
This work addresses the problem of improving ASR accuracy for Arabic dialects and accents, which is crucial for users in multilingual and informal settings, though it is incremental as it builds on existing models with lightweight adaptations.
The paper tackled the challenge of zero-shot automatic speech recognition (ASR) for diverse Arabic varieties, where error rates increase on dialectal and accented speech, by proposing context-aware decoding methods that condition inference on external side information without parameter updates, resulting in average relative WER reductions of 22.29% on Modern Standard Arabic, 20.54% on accented MSA, and 9.15% on dialectal Arabic.
Zero-shot ASR for Arabic remains challenging: while multilingual models perform well on Modern Standard Arabic (MSA), error rates rise sharply on dialectal and accented speech due to linguistic mismatch and scarce labeled data. We study context-aware decoding as a lightweight test-time adaptation paradigm that conditions inference on external side information without parameter updates. For promptable encoder-decoder ASR (e.g., Whisper), we incorporate context through (i) decoder prompting with first-pass hypotheses and (ii) encoder/decoder prefixing with retrieved speech-text exemplars, complemented by simple prompt reordering and optional speaker-matched synthetic exemplars to improve robustness in informal and multi-speaker settings. To extend contextual adaptation beyond promptable architectures, we introduce proxy-guided n-best selection for CTC ASR: given one or more external proxy hypotheses, we select from a model's n-best list by minimizing text-level distance to the proxies, enabling contextual inference without direct prompting. Across ten Arabic conditions spanning MSA, accented MSA, and multiple dialects, context-aware decoding yields average relative WER reductions of 22.29% on MSA, 20.54 on accented MSA, and 9.15% on dialectal Arabic. For CTC models, proxy-guided selection reduces WER by 15.6% relative on MSA and recovers a substantial fraction of oracle n-best gains, demonstrating that context-aware inference generalizes beyond encoder-decoder ASR.