DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer ASR
This work addresses a key bottleneck in low-latency speech recognition for real-time applications, representing an incremental improvement over existing unified systems.
The paper tackled the performance gap between streaming and non-streaming ASR systems by proposing a dynamic contextual carry-over mechanism in a unified Conformer-based model, achieving a 25.0% relative reduction in word error rate with minimal latency impact.
Conformer-based end-to-end models have become ubiquitous these days and are commonly used in both streaming and non-streaming automatic speech recognition (ASR). Techniques like dual-mode and dynamic chunk training helped unify streaming and non-streaming systems. However, there remains a performance gap between streaming with a full and limited past context. To address this issue, we propose the integration of a novel dynamic contextual carry-over mechanism in a state-of-the-art (SOTA) unified ASR system. Our proposed dynamic context Conformer (DCTX-Conformer) utilizes a non-overlapping contextual carry-over mechanism that takes into account both the left context of a chunk and one or more preceding context embeddings. We outperform the SOTA by a relative 25.0% word error rate, with a negligible latency impact due to the additional context embeddings.