ASCLLGSDMay 7, 2020

ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

arXiv:2005.03191v3302 citations
AI Analysis

This work addresses the performance gap in speech recognition for applications requiring efficient and accurate models, though it is incremental by building on existing CNN and transducer methods.

The paper tackles improving convolutional neural networks for automatic speech recognition by introducing ContextNet, a CNN-RNN-transducer architecture with a fully convolutional encoder incorporating global context via squeeze-and-excitation modules, achieving a word error rate of 2.1%/4.6% without external language model on LibriSpeech, which is competitive with or better than previous best systems.

Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.

Code Implementations6 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes