ASCLSDOct 26, 2020

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

arXiv:2010.13270v268 citations
Originality Incremental advance
AI Analysis

This work addresses the need for fast and accurate ASR systems for real-world deployment, offering an incremental improvement over prior non-autoregressive methods.

The paper tackles the performance gap of Mask-CTC, a non-autoregressive ASR system, by enhancing its encoder with Conformer and introducing new training/decoding methods for token deletion/insertion, resulting in improved word error rates (e.g., 15.5% to 9.1% on WSJ) while maintaining fast inference speed (<0.1 RTF on CPU).

For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC), Mask-CTC, fulfills this demand by generating tokens in a non-autoregressive fashion. While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems. To boost the performance of Mask-CTC, we first propose to enhance the encoder network architecture by employing a recently proposed architecture called Conformer. Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence, which allows the model to delete or insert tokens during inference. Experimental results on different ASR tasks show that the proposed approaches improve Mask-CTC significantly, outperforming a standard CTC model (15.5% $\rightarrow$ 9.1% WER on WSJ). Moreover, Mask-CTC now achieves competitive results to AR models with no degradation of inference speed ($<$ 0.1 RTF using CPU). We also show a potential application of Mask-CTC to end-to-end speech translation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes