ASCLMar 31, 2022

CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR

arXiv:2203.16758v223 citations
AI Analysis

This work addresses the trade-off between accuracy and latency in streaming ASR, which is crucial for real-time applications like voice assistants, and represents a strong specific gain rather than a foundational advancement.

The paper tackled the latency issue in streaming automatic speech recognition (ASR) by proposing the CUSIDE framework, which uses a simulation module to generate future context without waiting for it, achieving new state-of-the-art results on the AISHELL-1 dataset while drastically reducing latency.

History and future contextual information are known to be important for accurate acoustic modeling. However, acquiring future context brings latency for streaming ASR. In this paper, we propose a new framework - Chunking, Simulating Future Context and Decoding (CUSIDE) for streaming speech recognition. A new simulation module is introduced to recursively simulate the future contextual frames, without waiting for future context. The simulation module is jointly trained with the ASR model using a self-supervised loss; the ASR model is optimized with the usual ASR loss, e.g., CTC-CRF as used in our experiments. Experiments show that, compared to using real future frames as right context, using simulated future context can drastically reduce latency while maintaining recognition accuracy. With CUSIDE, we obtain new state-of-the-art streaming ASR results on the AISHELL-1 dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes