Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems
This addresses the challenge of improving semantic reasoning and latency in spoken dialogue systems for real-time human-computer interaction, representing an incremental advancement over existing duplex methods.
The paper tackles the problem of turn-taking and semantic reasoning in streaming full-duplex spoken dialogue systems by proposing SCoT, a Streaming Chain-of-Thought framework that processes user input and generates responses in blocks. The approach produces more coherent and interpretable responses than existing duplex methods while supporting lower-latency and overlapping interactions.
Most end-to-end (E2E) spoken dialogue systems (SDS) rely on voice activity detection (VAD) for turn-taking, but VAD fails to distinguish between pauses and turn completions. Duplex SDS models address this by predicting output continuously, including silence tokens, thus removing the need for explicit VAD. However, they often have complex dual-channel architecture and lag behind cascaded models in semantic reasoning. To overcome these challenges, we propose SCoT: a Streaming Chain-of-Thought (CoT) framework for Duplex SDS, alternating between processing fixed-duration user input and generating responses in a blockwise manner. Using frame-level alignments, we create intermediate targets-aligned user transcripts and system responses for each block. Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods while supporting lower-latency and overlapping interactions compared to turn-by-turn systems.