AS CL SDSep 30, 2024

Mamba for Streaming ASR Combined with Unimodal Aggregation

arXiv:2410.00070v211 citationsh-index: 4

Originality Incremental advance

AI Analysis

This work addresses the problem of improving efficiency and reducing latency in streaming ASR for users of speech recognition systems, offering an incremental improvement.

This paper explores the use of Mamba encoders for streaming Automatic Speech Recognition (ASR) and introduces a lookahead mechanism. It also proposes a streaming-style unimodal aggregation (UMA) method with early termination (ET) to improve token representation and reduce latency. The model achieves competitive ASR performance in accuracy and latency on two Mandarin Chinese datasets.

This paper works on streaming automatic speech recognition (ASR). Mamba, a recently proposed state space model, has demonstrated the ability to match or surpass Transformers in various tasks while benefiting from a linear complexity advantage. We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. Additionally, a streaming-style unimodal aggregation (UMA) method is implemented, which automatically detects token activity and streamingly triggers token output, and meanwhile aggregates feature frames for better learning token representation. Based on UMA, an early termination (ET) method is proposed to further reduce recognition latency. Experiments conducted on two Mandarin Chinese datasets demonstrate that the proposed model achieves competitive ASR performance in terms of both recognition accuracy and latency.

View on arXiv PDF

Similar