Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion
This addresses the challenge of conversational speech recognition for applications like voice assistants, though it appears incremental as it builds on existing embeddings and frameworks.
The paper tackles the problem of recognizing long conversations in speech recognition by incorporating conversational-context information across sentences, resulting in a significant improvement in word error rate on the Switchboard corpus compared to standard end-to-end models.
We present a novel conversational-context aware end-to-end speech recognizer based on a gated neural network that incorporates conversational-context/word/speech embeddings. Unlike conventional speech recognition models, our model learns longer conversational-context information that spans across sentences and is consequently better at recognizing long conversations. Specifically, we propose to use the text-based external word and/or sentence embeddings (i.e., fastText, BERT) within an end-to-end framework, yielding a significant improvement in word error rate with better conversational-context representation. We evaluated the models on the Switchboard conversational speech corpus and show that our model outperforms standard end-to-end speech recognition models.