AS CLJul 13, 2019

Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition

Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen

arXiv:1907.06017v114.539 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of improving speech recognition accuracy for Chinese datasets by efficiently transferring knowledge from language models, though it is incremental as it builds on existing knowledge distillation and fusion techniques.

The paper tackles the problem of integrating external language models into sequence-to-sequence speech recognition without adding components during testing by proposing a knowledge distillation approach, achieving a character error rate of 9.3% with an 18.42% relative reduction compared to the baseline.

Integrating an external language model into a sequence-to-sequence speech recognition system is non-trivial. Previous works utilize linear interpolation or a fusion network to integrate external language models. However, these approaches introduce external components, and increase decoding computation. In this paper, we instead propose a knowledge distillation based training approach to integrating external language models into a sequence-to-sequence model. A recurrent neural network language model, which is trained on large scale external text, generates soft labels to guide the sequence-to-sequence model training. Thus, the language model plays the role of the teacher. This approach does not add any external component to the sequence-to-sequence model during testing. And this approach is flexible to be combined with shallow fusion technique together for decoding. The experiments are conducted on public Chinese datasets AISHELL-1 and CLMAD. Our approach achieves a character error rate of 9.3%, which is relatively reduced by 18.42% compared with the vanilla sequence-to-sequence model.

View on arXiv PDF

Similar