CLFeb 16, 2025

DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities

arXiv:2502.11123v38 citationsh-index: 14Has CodeNLPCC
Originality Incremental advance
AI Analysis

This addresses the need for natural and efficient human-machine interactions by enabling duplex and streaming in speech conversations, though it is incremental as it builds on existing Mamba and Transformer approaches.

The paper tackles the problem of real-time speech conversation by proposing DuplexMamba, a Mamba-based model that enables simultaneous input processing and output generation for streaming capabilities, achieving performance comparable to Transformer-based models in ASR tasks and voice assistant benchmarks.

Real-time speech conversation is essential for natural and efficient human-machine interactions, requiring duplex and streaming capabilities. Traditional Transformer-based conversational chatbots operate in a turn-based manner and exhibit quadratic computational complexity that grows as the input size increases. In this paper, we propose DuplexMamba, a Mamba-based end-to-end multimodal duplex model for speech-to-text conversation. DuplexMamba enables simultaneous input processing and output generation, dynamically adjusting to support real-time streaming. Specifically, we develop a Mamba-based speech encoder and adapt it with a Mamba-based language model. Furthermore, we introduce a novel duplex decoding strategy that enables DuplexMamba to process input and generate output simultaneously. Experimental results demonstrate that DuplexMamba successfully implements duplex and streaming capabilities while achieving performance comparable to several recently developed Transformer-based models in automatic speech recognition (ASR) tasks and voice assistant benchmark evaluations. Our code and model are released.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes