Pushing the Limits of Non-Autoregressive Speech Recognition
This work addresses the need for faster and more efficient speech recognition systems, but it is incremental as it builds on existing methods to push performance limits.
The paper tackles the problem of improving non-autoregressive automatic speech recognition by combining end-to-end techniques like CTC, Conformer architectures, SpecAugment, and wav2vec2 pre-training, achieving state-of-the-art word error rates such as 1.8%/3.6% on LibriSpeech and 5.1%/9.8% on Switchboard without a language model.
We combine recent advancements in end-to-end speech recognition to non-autoregressive automatic speech recognition. We push the limits of non-autoregressive state-of-the-art results for multiple datasets: LibriSpeech, Fisher+Switchboard and Wall Street Journal. Key to our recipe, we leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training. We achieve 1.8%/3.6% WER on LibriSpeech test/test-other sets, 5.1%/9.8% WER on Switchboard, and 3.4% on the Wall Street Journal, all without a language model.