Multichannel End-to-end Speech Recognition
This work addresses noise robustness in speech recognition for applications like voice assistants, but it is incremental as it extends an existing end-to-end framework.
The paper tackled the problem of improving speech recognition in noisy environments by integrating microphone array signal processing into an end-to-end neural network, resulting in a system that outperformed a baseline with conventional beamforming on CHiME-4 and AMI benchmarks.
The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend the end-to-end framework to encompass microphone array signal processing for noise suppression and speech enhancement within the acoustic encoding network. This allows the beamforming components to be optimized jointly within the recognition architecture to improve the end-to-end speech recognition objective. Experiments on the noisy speech benchmarks (CHiME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer.