SDCLASNov 2, 2022

SpeechBlender: Speech Augmentation Framework for Mispronunciation Data Generation

arXiv:2211.00923v35 citationsh-index: 37
Originality Incremental advance
AI Analysis

This addresses data scarcity for mispronunciation detection models in second language learning, representing a strong domain-specific advancement.

The paper tackles the lack of labeled second language speech data for mispronunciation detection by introducing SpeechBlender, a fine-grained data augmentation pipeline that generates mispronunciation errors through targeted masking and signal interpolation. The method achieves state-of-the-art results with a 2.0% gain in Pearson Correlation Coefficient on Speechocean762 and shows improvements of 5.0% over baseline and 4.6% in F1-score on Arabic test data.

The lack of labeled second language (L2) speech data is a major challenge in designing mispronunciation detection models. We introduce SpeechBlender - a fine-grained data augmentation pipeline for generating mispronunciation errors to overcome such data scarcity. The SpeechBlender utilizes varieties of masks to target different regions of phonetic units, and use the mixing factors to linearly interpolate raw speech signals while augmenting pronunciation. The masks facilitate smooth blending of the signals, generating more effective samples than the `Cut/Paste' method. Our proposed technique achieves state-of-the-art results, with Speechocean762, on ASR dependent mispronunciation detection models at phoneme level, with a 2.0% gain in Pearson Correlation Coefficient (PCC) compared to the previous state-of-the-art [1]. Additionally, we demonstrate a 5.0% improvement at the phoneme level compared to our baseline. We also observed a 4.6% increase in F1-score with Arabic AraVoiceL2 testset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes