SDCLASMay 18, 2023

ZeroPrompt: Streaming Acoustic Encoders are Zero-Shot Masked LMs

arXiv:2305.10649v121 citations
Originality Incremental advance
AI Analysis

This work addresses latency reduction for streaming automatic speech recognition systems, presenting an incremental improvement with training-free methods.

The paper tackles the problem of reducing Token Display Time (TDT) in streaming ASR models without accuracy loss, achieving reductions of 350-700ms on First TDT and 100-400ms on Last TDT while maintaining equal WER on Aishell-1 and Librispeech datasets.

In this paper, we present ZeroPrompt (Figure 1-(a)) and the corresponding Prompt-and-Refine strategy (Figure 3), two simple but effective \textbf{training-free} methods to decrease the Token Display Time (TDT) of streaming ASR models \textbf{without any accuracy loss}. The core idea of ZeroPrompt is to append zeroed content to each chunk during inference, which acts like a prompt to encourage the model to predict future tokens even before they were spoken. We argue that streaming acoustic encoders naturally have the modeling ability of Masked Language Models and our experiments demonstrate that ZeroPrompt is engineering cheap and can be applied to streaming acoustic encoders on any dataset without any accuracy loss. Specifically, compared with our baseline models, we achieve 350 $\sim$ 700ms reduction on First Token Display Time (TDT-F) and 100 $\sim$ 400ms reduction on Last Token Display Time (TDT-L), with theoretically and experimentally equal WER on both Aishell-1 and Librispeech datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes