CL ASJul 23, 2025

One Whisper to Grade Them All

Nhan Phan, Anusha Porwal, Yaroslav Getman, Ekaterina Voskoboinik, Tamás Grósz, Mikko Kurimo

arXiv:2507.17918v12.7h-index: 8Slate

Originality Incremental advance

AI Analysis

This work addresses the need for practical, large-scale Computer-Assisted Language Learning systems by reducing inference time and eliminating transcription requirements, though it is incremental as it builds on existing Whisper models.

The paper tackled the problem of holistic Automatic Speaking Assessment for multi-part second-language tests by developing an efficient end-to-end system that processes all spoken responses with a single Whisper-small encoder and a lightweight aggregator, achieving a Root Mean Squared Error of 0.384 and outperforming a text-based baseline while using fewer parameters.

We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests, developed for the 2025 Speak & Improve Challenge. Our system's main novelty is the ability to process all four spoken responses with a single Whisper-small encoder, combine all information via a lightweight aggregator, and predict the final score. This architecture removes the need for transcription and per-part models, cuts inference time, and makes ASA practical for large-scale Computer-Assisted Language Learning systems. Our system achieved a Root Mean Squared Error (RMSE) of 0.384, outperforming the text-based baseline (0.44) while using at most 168M parameters (about 70% of Whisper-small). Furthermore, we propose a data sampling strategy, allowing the model to train on only 44.8% of the speakers in the corpus and still reach 0.383 RMSE, demonstrating improved performance on imbalanced classes and strong data efficiency.

View on arXiv PDF

Similar