Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and Diagnosis
This work addresses mispronunciation detection for Mandarin learners, but it is incremental as it builds on existing ASR methods with specific enhancements.
The paper tackled mispronunciation detection and diagnosis in Mandarin Chinese by introducing a stateless RNN-T model with HuBERT features and pitch embedding, achieving a 3% improvement in Phone Error Rate and a 7% increase in False Acceptance Rate over the state-of-the-art baseline in non-native scenarios.
Mispronunciation Detection and Diagnosis (MDD) systems, leveraging Automatic Speech Recognition (ASR), face two main challenges in Mandarin Chinese: 1) The two-stage models create an information gap between the phoneme or tone classification stage and the MDD stage. 2) The scarcity of Mandarin MDD datasets limits model training. In this paper, we introduce a stateless RNN-T model for Mandarin MDD, utilizing HuBERT features with pitch embedding through a Pitch Fusion Block. Our model, trained solely on native speaker data, shows a 3% improvement in Phone Error Rate and a 7% increase in False Acceptance Rate over the state-of-the-art baseline in non-native scenarios