Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning
This addresses the need for reliable automated assessment in computer-assisted language learning for L2 English speakers, offering a more accurate and deployable solution.
The paper tackles the problem of spoken language assessment by introducing a multimodal foundation model that processes entire response sessions to predict oral proficiency, outperforming previous state-of-the-art systems on the Speak & Improve benchmark with robust cross-part generalization.
Spoken Language Assessment (SLA) estimates a learner's oral proficiency from spontaneous speech. The growing population of L2 English speakers has intensified the demand for reliable SLA, a critical component of Computer Assisted Language Learning (CALL). Existing efforts often rely on cascaded pipelines, which are prone to error propagation, or end-to-end models that often operate on a short audio window, which might miss discourse-level evidence. This paper introduces a novel multimodal foundation model approach that performs session-level evaluation in a single pass. Our approach couples multi-target learning with a frozen, Whisper ASR model-based speech prior for acoustic-aware calibration, allowing for jointly learning holistic and trait-level objectives of SLA without resorting to handcrafted features. By coherently processing the entire response session of an L2 speaker, the model excels at predicting holistic oral proficiency. Experiments conducted on the Speak & Improve benchmark demonstrate that our proposed approach outperforms the previous state-of-the-art cascaded system and exhibits robust cross-part generalization, producing a compact deployable grader that is tailored for CALL applications.