A transfer learning based approach for pronunciation scoring
This work addresses the problem of accurate pronunciation scoring for language learners, but it is incremental as it builds on existing ASR and transfer learning methods.
The paper tackles the challenge of phone-level pronunciation scoring by proposing a transfer learning approach that adapts an ASR model to this task, achieving a 20% improvement over a state-of-the-art GOP system on the EpaDB database.
Phone-level pronunciation scoring is a challenging task, with performance far from that of human annotators. Standard systems generate a score for each phone in a phrase using models trained for automatic speech recognition (ASR) with native data only. Better performance has been shown when using systems that are trained specifically for the task using non-native data. Yet, such systems face the challenge that datasets labelled for this task are scarce and usually small. In this paper, we present a transfer learning-based approach that leverages a model trained for ASR, adapting it for the task of pronunciation scoring. We analyze the effect of several design choices and compare the performance with a state-of-the-art goodness of pronunciation (GOP) system. Our final system is 20% better than the GOP system on EpaDB, a database for pronunciation scoring research, for a cost function that prioritizes low rates of unnecessary corrections.