Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network
This work addresses the challenge of improving speech recognition accuracy with contextual cues, which is incremental as it builds on existing deep bias methods by adding explicit supervision.
The paper tackled the problem of incorporating contextual information into end-to-end speech recognition models by introducing a contextual phrase prediction network with explicit supervision, achieving a 12.1% relative word error rate (WER) improvement over the baseline and a 40.5% relative WER reduction for context phrases on the LibriSpeech corpus.
Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.