Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment
This work addresses automated spoken language assessment for L2 learners, offering an incremental improvement by leveraging existing foundation models more effectively.
The paper tackled the problem of assessing L2 English oral proficiency by probing Whisper's hidden representations for acoustic and linguistic features, achieving strong performance on the GEPT dataset and outperforming existing baselines with a lightweight classifier.
In this paper, we explore the untapped potential of Whisper, a well-established automatic speech recognition (ASR) foundation model, in the context of L2 spoken language assessment (SLA). Unlike prior studies that extrinsically analyze transcriptions produced by Whisper, our approach goes a step further to probe its latent capabilities by extracting acoustic and linguistic features from hidden representations. With only a lightweight classifier being trained on top of Whisper's intermediate and final outputs, our method achieves strong performance on the GEPT picture-description dataset, outperforming existing cutting-edge baselines, including a multimodal approach. Furthermore, by incorporating image and text-prompt information as auxiliary relevance cues, we demonstrate additional performance gains. Finally, we conduct an in-depth analysis of Whisper's embeddings, which reveals that, even without task-specific fine-tuning, the model intrinsically encodes both ordinal proficiency patterns and semantic aspects of speech, highlighting its potential as a powerful foundation for SLA and other spoken language understanding tasks.