ASCLLGSDApr 7, 2021

Pushing the Limits of Non-Autoregressive Speech Recognition

arXiv:2104.03416v432 citations
Originality Incremental advance
AI Analysis

This work addresses the need for faster and more efficient speech recognition systems, but it is incremental as it builds on existing methods to push performance limits.

The paper tackles the problem of improving non-autoregressive automatic speech recognition by combining end-to-end techniques like CTC, Conformer architectures, SpecAugment, and wav2vec2 pre-training, achieving state-of-the-art word error rates such as 1.8%/3.6% on LibriSpeech and 5.1%/9.8% on Switchboard without a language model.

We combine recent advancements in end-to-end speech recognition to non-autoregressive automatic speech recognition. We push the limits of non-autoregressive state-of-the-art results for multiple datasets: LibriSpeech, Fisher+Switchboard and Wall Street Journal. Key to our recipe, we leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training. We achieve 1.8%/3.6% WER on LibriSpeech test/test-other sets, 5.1%/9.8% WER on Switchboard, and 3.4% on the Wall Street Journal, all without a language model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes