ASCLApr 4, 2021

TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech Recognition

arXiv:2104.01522v124 citations
Originality Incremental advance
AI Analysis

This work addresses efficiency and convergence issues in speech recognition models for applications requiring fast inference, though it is incremental as it builds on existing NAR approaches.

The paper tackles the performance gap and training difficulty of non-autoregressive (NAR) models in speech recognition by proposing TSNAT, a two-step NAR transformer that learns from a shared autoregressive model and uses a two-stage inference method, achieving competitive performance with autoregressive models on the ASIEHLL-1 dataset.

The autoregressive (AR) models, such as attention-based encoder-decoder models and RNN-Transducer, have achieved great success in speech recognition. They predict the output sequence conditioned on the previous tokens and acoustic encoded states, which is inefficient on GPUs. The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step. However, the NAR model still faces two major problems. On the one hand, there is still a great gap in performance between the NAR models and the advanced AR models. On the other hand, it's difficult for most of the NAR models to train and converge. To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT), which improves the performance and accelerating the convergence of the NAR model by learning prior knowledge from a parameters-sharing AR model. Furthermore, we introduce the two-stage method into the inference process, which improves the model performance greatly. All the experiments are conducted on a public Chinese mandarin dataset ASIEHLL-1. The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes