CLLGASApr 15, 2022

Streaming Align-Refine for Non-autoregressive Deliberation

arXiv:2204.07556v13 citationsh-index: 69
Originality Incremental advance
AI Analysis

This work addresses the need for efficient, low-latency speech recognition systems, particularly for real-time applications like voice search, though it is incremental as it adapts an existing offline algorithm to streaming contexts.

The paper tackles the problem of low-latency speech recognition by proposing a streaming non-autoregressive decoding algorithm that refines hypothesis alignments from a streaming RNN-T model, achieving high efficiency and low latency with results comparable to offline models on voice search and Librispeech datasets, including further WER gains through discriminative training.

We propose a streaming non-autoregressive (non-AR) decoding algorithm to deliberate the hypothesis alignment of a streaming RNN-T model. Our algorithm facilitates a simple greedy decoding procedure, and at the same time is capable of producing the decoding result at each frame with limited right context, thus enjoying both high efficiency and low latency. These advantages are achieved by converting the offline Align-Refine algorithm to be streaming-compatible, with a novel transformer decoder architecture that performs local self-attentions for both text and audio, and a time-aligned cross-attention at each layer. Furthermore, we perform discriminative training of our model with the minimum word error rate (MWER) criterion, which has not been done in the non-AR decoding literature. Experiments on voice search datasets and Librispeech show that with reasonable right context, our streaming model performs as well as the offline counterpart, and discriminative training leads to further WER gain when the first-pass model has small capacity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes