End-to-End Multi-Channel Transformer for Speech Recognition
This work addresses the problem of improving speech recognition accuracy in far-field environments for users of multi-microphone devices, offering an incremental improvement over existing methods.
This paper proposes a multi-channel transformer for speech recognition that integrates spectral and spatial information from multiple microphones using attention layers. The method outperforms baseline single-channel transformers and neural beamformers cascaded with transformers on a far-field in-house dataset.
Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. Our multi-channel transformer network mainly consists of three parts: channel-wise self attention layers (CSA), cross-channel attention layers (CCA), and multi-channel encoder-decoder attention layers (EDA). The CSA and CCA layers encode the contextual relationship within and between channels and across time, respectively. The channel-attended outputs from CSA and CCA are then fed into the EDA layers to help decode the next token given the preceding ones. The experiments show that in a far-field in-house dataset, our method outperforms the baseline single-channel transformer, as well as the super-directive and neural beamformers cascaded with the transformers.