CLASMay 19, 2023

Phonetic and Prosody-aware Self-supervised Learning Approach for Non-native Fluency Scoring

arXiv:2305.11438v14 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of automated fluency assessment for non-native speakers, representing an incremental advance in applying self-supervised learning to a domain-specific task.

The paper tackles the problem of limited labeled data for deep learning-based non-native speech fluency scoring by introducing a self-supervised learning approach that incorporates phonetic and prosody awareness, resulting in improved performance over baselines as measured by Pearson correlation coefficients on datasets like Speechocean762.

Speech fluency/disfluency can be evaluated by analyzing a range of phonetic and prosodic features. Deep neural networks are commonly trained to map fluency-related features into the human scores. However, the effectiveness of deep learning-based models is constrained by the limited amount of labeled training samples. To address this, we introduce a self-supervised learning (SSL) approach that takes into account phonetic and prosody awareness for fluency scoring. Specifically, we first pre-train the model using a reconstruction loss function, by masking phones and their durations jointly on a large amount of unlabeled speech and text prompts. We then fine-tune the pre-trained model using human-annotated scoring data. Our experimental results, conducted on datasets such as Speechocean762 and our non-native datasets, show that our proposed method outperforms the baseline systems in terms of Pearson correlation coefficients (PCC). Moreover, we also conduct an ablation study to better understand the contribution of phonetic and prosody factors during the pre-training stage.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes