An Effective Strategy for Modeling Score Ordinality and Non-uniform Intervals in Automated Speaking Assessment
This work addresses the problem of more accurate automated speaking assessment for non-native speakers, representing an incremental improvement over prior methods.
The paper tackles the problem of automated speaking assessment by addressing limitations in existing self-supervised learning approaches that ignore the ordinal structure and non-uniform intervals of proficiency labels. The proposed method, which combines SSL with handcrafted features and a multi-margin ordinal loss, outperforms strong baselines on the TEEMI corpus and generalizes well to unseen prompts.
A recent line of research on automated speaking assessment (ASA) has benefited from self-supervised learning (SSL) representations, which capture rich acoustic and linguistic patterns in non-native speech without underlying assumptions of feature curation. However, speech-based SSL models capture acoustic-related traits but overlook linguistic content, while text-based SSL models rely on ASR output and fail to encode prosodic nuances. Moreover, most prior arts treat proficiency levels as nominal classes, ignoring their ordinal structure and non-uniform intervals between proficiency labels. To address these limitations, we propose an effective ASA approach combining SSL with handcrafted indicator features via a novel modeling paradigm. We further introduce a multi-margin ordinal loss that jointly models both the score ordinality and non-uniform intervals of proficiency labels. Extensive experiments on the TEEMI corpus show that our method consistently outperforms strong baselines and generalizes well to unseen prompts.