ASLGSDMLNov 10, 2019

Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition

arXiv:1911.04908v238 citations
Originality Incremental advance
AI Analysis

This addresses the computational inefficiency of production speech recognition systems, offering a faster alternative while maintaining accuracy, though it is incremental as it builds on existing non-autoregressive methods.

The paper tackles the high inference cost of autoregressive transformers in speech recognition by proposing non-autoregressive transformer structures (A-CMLM and A-FMLM) that predict masked tokens iteratively, achieving performance matching state-of-the-art autoregressive transformers with a 7x speedup on the Aishell benchmark.

Recently very deep transformers have outperformed conventional bi-directional long short-term memory networks by a large margin in speech recognition. However, to put it into production usage, inference computation cost is still a serious concern in real scenarios. In this paper, we study two different non-autoregressive transformer structure for automatic speech recognition (ASR): A-CMLM and A-FMLM. During training, for both frameworks, input tokens fed to the decoder are randomly replaced by special mask tokens. The network is required to predict the tokens corresponding to those mask tokens by taking both unmasked context and input speech into consideration. During inference, we start from all mask tokens and the network iteratively predicts missing tokens based on partial results. We show that this framework can support different decoding strategies, including traditional left-to-right. A new decoding strategy is proposed as an example, which starts from the easiest predictions to the most difficult ones. Results on Mandarin (Aishell) and Japanese (CSJ) ASR benchmarks show the possibility to train such a non-autoregressive network for ASR. Especially in Aishell, the proposed method outperformed the Kaldi ASR system and it matches the performance of the state-of-the-art autoregressive transformer with 7x speedup. Pretrained models and code will be made available after publication.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes