CLASJun 5, 2023

End-to-End Word-Level Pronunciation Assessment with MASK Pre-training

Microsoft
arXiv:2306.02682v19 citationsh-index: 27
Originality Incremental advance
AI Analysis

This addresses misalignment issues in computer-aided pronunciation training systems, offering an incremental improvement over existing methods.

The paper tackled the problem of word-level pronunciation assessment by proposing an end-to-end method that avoids alignment components, achieving better performance on the SpeechOcean762 dataset without explicit alignment.

Pronunciation assessment is a major challenge in the computer-aided pronunciation training system, especially at the word (phoneme)-level. To obtain word (phoneme)-level scores, current methods usually rely on aligning components to obtain acoustic features of each word (phoneme), which limits the performance of assessment to the accuracy of alignments. Therefore, to address this problem, we propose a simple yet effective method, namely \underline{M}asked pre-training for \underline{P}ronunciation \underline{A}ssessment (MPA). Specifically, by incorporating a mask-predict strategy, our MPA supports end-to-end training without leveraging any aligning components and can solve misalignment issues to a large extent during prediction. Furthermore, we design two evaluation strategies to enable our model to conduct assessments in both unsupervised and supervised settings. Experimental results on SpeechOcean762 dataset demonstrate that MPA could achieve better performance than previous methods, without any explicit alignment. In spite of this, MPA still has some limitations, such as requiring more inference time and reference text. They expect to be addressed in future work.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes