DuplexCascade: Full-Duplex Speech-to-Speech Dialogue with VAD-Free Cascaded ASR-LLM-TTS Pipeline and Micro-Turn Optimization
This addresses the challenge of maintaining conversational intelligence while allowing natural, overlapping speech interactions in spoken dialog systems, representing an incremental improvement over existing cascaded approaches.
The paper tackles the problem of enabling full-duplex speech-to-speech dialogue in cascaded ASR-LLM-TTS systems by eliminating VAD segmentation, using micro-turn interactions and control tokens to coordinate turn-taking. It achieves state-of-the-art full-duplex turn-taking and strong conversational intelligence on benchmarks like Full-DuplexBench and VoiceBench.
Spoken dialog systems with cascaded ASR-LLM-TTS modules retain strong LLM intelligence, but VAD segmentation often forces half-duplex turns and brittle control. On the other hand, VAD-free end-to-end model support full-duplex interaction but is hard to maintain conversational intelligence. In this paper, we present DuplexCascade, a VAD-free cascaded streaming pipeline for full-duplex speech-to-speech dialogue. Our key idea is to convert conventional utterance-wise long turns into chunk-wise micro-turn interactions, enabling rapid bidirectional exchange while preserving the strengths of a capable text LLM. To reliably coordinate turn-taking and response timing, we introduce a set of conversational special control tokens that steer the LLM's behavior under streaming constraints. On Full-DuplexBench and VoiceBench, DuplexCascade delivers state-of-the-art full-duplex turn-taking and strong conversational intelligence among open-source speech-to-speech dialogue systems.