AS LG SDMay 12, 2021

Attention-based Neural Beamforming Layers for Multi-channel Speech Recognition

Bhargav Pulugundla, Yang Gao, Brian King, Gokce Keskin, Harish Mallidi, Minhua Wu, Jasha Droppo, Roland Maas

arXiv:2105.05920v23.32 citations

Originality Incremental advance

AI Analysis

This work addresses speech recognition in noisy environments for applications like voice assistants, but it is incremental as it builds on existing attention-based beamformers.

The paper tackled the problem of multi-channel speech recognition by proposing a 2D Conv-Attention module that combines convolution neural networks with attention for beamforming, resulting in a 3.8% relative improvement in word error rate over a baseline neural beamformer.

Attention-based beamformers have recently been shown to be effective for multi-channel speech recognition. However, they are less capable at capturing local information. In this work, we propose a 2D Conv-Attention module which combines convolution neural networks with attention for beamforming. We apply self- and cross-attention to explicitly model the correlations within and between the input channels. The end-to-end 2D Conv-Attention model is compared with a multi-head self-attention and superdirective-based neural beamformers. We train and evaluate on an in-house multi-channel dataset. The results show a relative improvement of 3.8% in WER by the proposed model over the baseline neural beamformer.

View on arXiv PDF

Similar