CLSDASSep 3, 2025

Comparison of End-to-end Speech Assessment Models for the NOCASA 2025 Challenge

arXiv:2509.03256v11 citationsh-index: 2MLSP
Originality Synthesis-oriented
AI Analysis

This work addresses pronunciation assessment for children learning Norwegian as a second language, but it is incremental as it compares and optimizes existing model types for a specific challenge.

This paper tackled the problem of automatic word-level pronunciation assessment for children learning Norwegian as a second language by analyzing three end-to-end models, with the GOP-CTC-based model achieving the highest performance and substantially surpassing challenge baselines to attain top leaderboard scores.

This paper presents an analysis of three end-to-end models developed for the NOCASA 2025 Challenge, aimed at automatic word-level pronunciation assessment for children learning Norwegian as a second language. Our models include an encoder-decoder Siamese architecture (E2E-R), a prefix-tuned direct classification model leveraging pretrained wav2vec2.0 representations, and a novel model integrating alignment-free goodness-of-pronunciation (GOP) features computed via CTC. We introduce a weighted ordinal cross-entropy loss tailored for optimizing metrics such as unweighted average recall and mean absolute error. Among the explored methods, our GOP-CTC-based model achieved the highest performance, substantially surpassing challenge baselines and attaining top leaderboard scores.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes