Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition
This work addresses accent mismatch in speech recognition, offering a novel method that improves performance for multi-accent scenarios, though it is incremental relative to existing multi-task approaches.
The paper tackled the problem of automatic speech recognition performance degradation due to accent mismatch by jointly learning an accent classifier and a multi-task acoustic model, resulting in relative improvements of 5.94% in word error rate on British English and 9.47% on American English compared to a baseline.
The performance of automatic speech recognition systems degrades with increasing mismatch between the training and testing scenarios. Differences in speaker accents are a significant source of such mismatch. The traditional approach to deal with multiple accents involves pooling data from several accents during training and building a single model in multi-task fashion, where tasks correspond to individual accents. In this paper, we explore an alternate model where we jointly learn an accent classifier and a multi-task acoustic model. Experiments on the American English Wall Street Journal and British English Cambridge corpora demonstrate that our joint model outperforms the strong multi-task acoustic model baseline. We obtain a 5.94% relative improvement in word error rate on British English, and 9.47% relative improvement on American English. This illustrates that jointly modeling with accent information improves acoustic model performance.