SD CL LG ASMay 31, 2023

Zero-Shot Automatic Pronunciation Assessment

arXiv:2305.19563v18.49 citations

Originality Incremental advance

AI Analysis

This addresses the problem of pronunciation assessment for language learners by offering a zero-shot approach, though it is incremental as it builds on pre-trained models and existing benchmarks.

The paper tackles automatic pronunciation assessment without annotated data by proposing a zero-shot method using HuBERT, masking, and token recovery, achieving comparable performance to supervised regression baselines on speechocean762 with a Pearson Correlation Coefficient metric.

Automatic Pronunciation Assessment (APA) is vital for computer-assisted language learning. Prior methods rely on annotated speech-text data to train Automatic Speech Recognition (ASR) models or speech-score data to train regression models. In this work, we propose a novel zero-shot APA method based on the pre-trained acoustic model, HuBERT. Our method involves encoding speech input and corrupting them via a masking module. We then employ the Transformer encoder and apply k-means clustering to obtain token sequences. Finally, a scoring module is designed to measure the number of wrongly recovered tokens. Experimental results on speechocean762 demonstrate that the proposed method achieves comparable performance to supervised regression baselines and outperforms non-regression baselines in terms of Pearson Correlation Coefficient (PCC). Additionally, we analyze how masking strategies affect the performance of APA.

View on arXiv PDF

Similar