CLAILGNEASApr 15, 2024

Deferred NAM: Low-latency Top-K Context Injection via Deferred Context Encoding for Non-Streaming ASR

arXiv:2404.10180v227 citationsh-index: 36NAACL
Originality Incremental advance
AI Analysis

This work addresses latency issues in non-streaming ASR systems for applications requiring real-time transcription of context-specific phrases, representing an incremental improvement over existing attention-based biasing methods.

The paper tackled the latency bottleneck in contextual biasing for speech recognition by moving the lightweight phrase selection pass before context encoding, achieving a 16.1x speedup and enabling scaling to 20K phrases with under 33ms delay, while also reducing word error rate by up to 37.5% with additional losses.

Contextual biasing enables speech recognizers to transcribe important phrases in the speaker's context, such as contact names, even if they are rare in, or absent from, the training data. Attention-based biasing is a leading approach which allows for full end-to-end cotraining of the recognizer and biasing system and requires no separate inference-time components. Such biasers typically consist of a context encoder; followed by a context filter which narrows down the context to apply, improving per-step inference time; and, finally, context application via cross attention. Though much work has gone into optimizing per-frame performance, the context encoder is at least as important: recognition cannot begin before context encoding ends. Here, we show the lightweight phrase selection pass can be moved before context encoding, resulting in a speedup of up to 16.1 times and enabling biasing to scale to 20K phrases with a maximum pre-decoding delay under 33ms. With the addition of phrase- and wordpiece-level cross-entropy losses, our technique also achieves up to a 37.5% relative WER reduction over the baseline without the losses and lightweight phrase selection pass.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes