CL AI LG ASFeb 10, 2025

Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment

Kwanghee Choi, Eunjung Yeo, Kalvin Chang, Shinji Watanabe, David Mortensen

CMU

arXiv:2502.07029v223.019 citationsh-index: 18Has CodeNAACL

Originality Incremental advance

AI Analysis

This work addresses the challenge of distinguishing atypical from typical pronunciations in speech assessment, with potential applications in healthcare and language learning, though it appears incremental in method.

The paper tackled the problem of modeling allophonic variation for atypical pronunciation assessment by proposing MixGoP, a method using Gaussian mixture models with self-supervised speech features, which achieved state-of-the-art performance on four out of five datasets including dysarthric and non-native speech.

Allophony refers to the variation in the phonetic realization of a phoneme based on its phonetic environment. Modeling allophones is crucial for atypical pronunciation assessment, which involves distinguishing atypical from typical pronunciations. However, recent phoneme classifier-based approaches often simplify this by treating various realizations as a single phoneme, bypassing the complexity of modeling allophonic variation. Motivated by the acoustic modeling capabilities of frozen self-supervised speech model (S3M) features, we propose MixGoP, a novel approach that leverages Gaussian mixture models to model phoneme distributions with multiple subclusters. Our experiments show that MixGoP achieves state-of-the-art performance across four out of five datasets, including dysarthric and non-native speech. Our analysis further suggests that S3M features capture allophonic variation more effectively than MFCCs and Mel spectrograms, highlighting the benefits of integrating MixGoP with S3M features.

View on arXiv PDF Code

Similar