ASLGSDAug 30, 2021

Multi-Channel Transformer Transducer for Speech Recognition

arXiv:2108.12953v124 citations
Originality Highly original
AI Analysis

This work addresses the need for efficient, low-latency multi-channel speech recognition for on-device systems, offering a novel method that improves accuracy and speed over prior approaches.

The paper tackles the problem of high computational complexity in multi-channel speech recognition models by proposing the Multi-Channel Transformer Transducer (MCTT), which achieves up to 11.62% relative WER improvement and is 15.8 times faster in inference speed compared to existing methods.

Multi-channel inputs offer several advantages over single-channel, to improve the robustness of on-device speech recognition systems. Recent work on multi-channel transformer, has proposed a way to incorporate such inputs into end-to-end ASR for improved accuracy. However, this approach is characterized by a high computational complexity, which prevents it from being deployed in on-device systems. In this paper, we present a novel speech recognition model, Multi-Channel Transformer Transducer (MCTT), which features end-to-end multi-channel training, low computation cost, and low latency so that it is suitable for streaming decoding in on-device speech recognition. In a far-field in-house dataset, our MCTT outperforms stagewise multi-channel models with transformer-transducer up to 6.01% relative WER improvement (WERR). In addition, MCTT outperforms the multi-channel transformer up to 11.62% WERR, and is 15.8 times faster in terms of inference speed. We further show that we can improve the computational cost of MCTT by constraining the future and previous context in attention computations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes