ASCLSDOct 11, 2021

A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation

arXiv:2110.05249v151 citations
Originality Synthesis-oriented
AI Analysis

This incremental work addresses real-time speech processing challenges by analyzing NAR methods for researchers and practitioners.

The study compared non-autoregressive (NAR) models for speech-to-text generation, finding that they reduce inference speed but trade off accuracy, with techniques combinable for improvements in tasks like automatic speech recognition and speech translation.

Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines. Showing great potential for real-time applications, an increasing number of NAR models have been explored in different fields to mitigate the performance gap against AR models. In this work, we conduct a comparative study of various NAR modeling methods for end-to-end automatic speech recognition (ASR). Experiments are performed in the state-of-the-art setting using ESPnet. The results on various tasks provide interesting findings for developing an understanding of NAR ASR, such as the accuracy-speed trade-off and robustness against long-form utterances. We also show that the techniques can be combined for further improvement and applied to NAR end-to-end speech translation. All the implementations are publicly available to encourage further research in NAR speech processing.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes