SDAIASJun 17, 2025

Unifying Streaming and Non-streaming Zipformer-based ASR

arXiv:2506.14434v12 citationsh-index: 13ACL
Originality Incremental advance
AI Analysis

This work addresses the need for efficient ASR deployment in production settings, though it is incremental in improving existing zipformer models.

The authors tackled the problem of unifying streaming and non-streaming automatic speech recognition models to reduce costs, achieving a 7.9% relative reduction in word error with minimal latency degradation.

There has been increasing interest in unifying streaming and non-streaming automatic speech recognition (ASR) models to reduce development, training, and deployment costs. We present a unified framework that trains a single end-to-end ASR model for both streaming and non-streaming applications, leveraging future context information. We propose to use dynamic right-context through the chunked attention masking in the training of zipformer-based ASR models. We demonstrate that using right-context is more effective in zipformer models compared to other conformer models due to its multi-scale nature. We analyze the effect of varying the number of right-context frames on accuracy and latency of the streaming ASR models. We use Librispeech and large in-house conversational datasets to train different versions of streaming and non-streaming models and evaluate them in a production grade server-client setup across diverse testsets of different domains. The proposed strategy reduces word error by relative 7.9\% with a small degradation in user-perceived latency. By adding more right-context frames, we are able to achieve streaming performance close to that of non-streaming models. Our approach also allows flexible control of the latency-accuracy tradeoff according to customers requirements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes