Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan, Nune Tadevosyan, Vitaly Lavrukhin, Boris Ginsburg

NVIDIA

arXiv:2604.1907971.4h-index: 18Has Code

AI Analysis

For ASR practitioners, this reduces development costs by enabling a single model for both offline and streaming use cases, with demonstrated improvements in streaming accuracy.

The paper addresses the challenge of unifying offline and streaming ASR in a single Transducer model. The proposed consistency regularization method improves streaming accuracy at low latency while preserving offline performance, with open-sourced models.

Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) training that supports both offline and streaming decoding within a single model, using chunk-limited attention with right context and dynamic chunked convolutions. To further close the gap between offline and streaming performance, we introduce an efficient Triton implementation of mode-consistency regularization for RNNT (MCR-RNNT), which encourages agreement across training modes. Experiments show that the proposed approach improves streaming accuracy at low latency while preserving offline performance and scaling to larger model sizes and training datasets. The proposed Unified ASR framework and the English model checkpoint are open-sourced.

View on arXiv PDF

Similar