SD AIJun 8

Automated Pronunciation Evaluation for Korean Toddler Speech using Speech Diarization and Self-Supervised Learning

Diane Myung-kyung Woodbridge, Jee Hyun Suh

arXiv:2606.10213v16.4

Predicted impact top 26% in SD · last 90 daysOriginality Synthesis-oriented

AI Analysis

It addresses the underdeveloped area of automated assessment for Korean pediatric speech disorders, but the results are modest and the dataset is small.

The paper tackles automated pronunciation evaluation for Korean toddler speech, achieving balanced accuracies of 0.720 for consonants and 0.845 for vowels using an ensemble of self-supervised models.

Speech sound disorders affect approximately 44% of Korean pediatric communication disorder cases, yet automated assessment tools for Korean toddler speech remain underdeveloped. This paper presents an end-to-end pipeline for automated pronunciation evaluation of Korean toddler speech, combining neural speaker diarization with self-supervised speech representation learning. We introduce a novel IRB-approved corpus of 53 recordings from Korean-speaking children aged 2-5 years. A subset of 53 subjects was annotated by three independent reviewers, yielding 1,190 consonant and 748 vowel word-level binary correctness labels. We evaluate three diarization models, finding that NeMo SortFormer achieves 88.69% speaker count accuracy and 33.04% diarization error rate (DER) owing to its arrival-time-sorted transformer architecture, which handles the acoustic confound between young female caregivers exhibiting aegyo and toddler speech. For pronunciation scoring, we compare three self-supervised learning (SSL) backbones across multiple pooling strategies. A cross-model ensemble routing consonant prediction to HuBERT-large and vowel prediction to WavLM-large achieves balanced accuracies of 0.720 and 0.845, with a mean of 0.782.

View on arXiv PDF

Similar