SDCLMar 14, 2017

Multichannel End-to-end Speech Recognition

arXiv:1703.04783v194 citations
Originality Incremental advance
AI Analysis

This work addresses noise robustness in speech recognition for applications like voice assistants, but it is incremental as it extends an existing end-to-end framework.

The paper tackled the problem of improving speech recognition in noisy environments by integrating microphone array signal processing into an end-to-end neural network, resulting in a system that outperformed a baseline with conventional beamforming on CHiME-4 and AMI benchmarks.

The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend the end-to-end framework to encompass microphone array signal processing for noise suppression and speech enhancement within the acoustic encoding network. This allows the beamforming components to be optimized jointly within the recognition architecture to improve the end-to-end speech recognition objective. Experiments on the noisy speech benchmarks (CHiME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes