CL SD ASSep 3, 2025

Comparison of End-to-end Speech Assessment Models for the NOCASA 2025 Challenge

arXiv:2509.03256v14.91 citationsh-index: 2MLSP

Originality Synthesis-oriented

AI Analysis

This work addresses pronunciation assessment for children learning Norwegian as a second language, but it is incremental as it compares and optimizes existing model types for a specific challenge.

This paper tackled the problem of automatic word-level pronunciation assessment for children learning Norwegian as a second language by analyzing three end-to-end models, with the GOP-CTC-based model achieving the highest performance and substantially surpassing challenge baselines to attain top leaderboard scores.

This paper presents an analysis of three end-to-end models developed for the NOCASA 2025 Challenge, aimed at automatic word-level pronunciation assessment for children learning Norwegian as a second language. Our models include an encoder-decoder Siamese architecture (E2E-R), a prefix-tuned direct classification model leveraging pretrained wav2vec2.0 representations, and a novel model integrating alignment-free goodness-of-pronunciation (GOP) features computed via CTC. We introduce a weighted ordinal cross-entropy loss tailored for optimizing metrics such as unweighted average recall and mean absolute error. Among the explored methods, our GOP-CTC-based model achieved the highest performance, substantially surpassing challenge baselines and attaining top leaderboard scores.

View on arXiv PDF

Similar