ASSDMay 18, 2020

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

arXiv:2005.08700v2157 citations
AI Analysis

This addresses the need for faster, real-time end-to-end ASR systems, though it builds incrementally on existing CTC and non-autoregressive methods.

The paper tackles the problem of slow inference in autoregressive speech recognition models by proposing Mask CTC, a non-autoregressive framework that refines CTC outputs using mask prediction. It reduces word error rate from 17.9% to 12.1% on WSJ while cutting inference time to 0.07 real-time factor on CPUs.

We present Mask CTC, a novel non-autoregressive end-to-end automatic speech recognition (ASR) framework, which generates a sequence by refining outputs of the connectionist temporal classification (CTC). Neural sequence-to-sequence models are usually \textit{autoregressive}: each output token is generated by conditioning on previously generated tokens, at the cost of requiring as many iterations as the output length. On the other hand, non-autoregressive models can simultaneously generate tokens within a constant number of iterations, which results in significant inference time reduction and better suits end-to-end ASR model for real-world scenarios. In this work, Mask CTC model is trained using a Transformer encoder-decoder with joint training of mask prediction and CTC. During inference, the target sequence is initialized with the greedy CTC outputs and low-confidence tokens are masked based on the CTC probabilities. Based on the conditional dependence between output tokens, these masked low-confidence tokens are then predicted conditioning on the high-confidence tokens. Experimental results on different speech recognition tasks show that Mask CTC outperforms the standard CTC model (e.g., 17.9% -> 12.1% WER on WSJ) and approaches the autoregressive model, requiring much less inference time using CPUs (0.07 RTF in Python implementation). All of our codes will be publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes