AS CLJul 25, 2025

FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems

Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, Eng Siong Chng

arXiv:2507.19040v18.612 citationsh-index: 16Has CodeINTERSPEECH

Originality Synthesis-oriented

AI Analysis

This addresses the problem of evaluating natural human-machine interactions for researchers and developers in spoken dialogue systems, but it is incremental as it focuses on benchmarking rather than new model development.

The paper tackles the lack of benchmarks for full-duplex spoken dialogue systems by introducing FD-Bench, a pipeline using LLMs, TTS, and ASR to evaluate models on user interruptions and delays, applied to three systems with over 40 hours of speech and 1,200 interruptions, showing all models struggle with interruptions and noise.

Full-duplex spoken dialogue systems (FDSDS) enable more natural human-machine interactions by allowing real-time user interruptions and backchanneling, compared to traditional SDS that rely on turn-taking. However, existing benchmarks lack metrics for FD scenes, e.g., evaluating model performance during user interruptions. In this paper, we present a comprehensive FD benchmarking pipeline utilizing LLMs, TTS, and ASR to address this gap. It assesses FDSDS's ability to handle user interruptions, manage delays, and maintain robustness in challenging scenarios with diverse novel metrics. We applied our benchmark to three open-source FDSDS (Moshi, Freeze-omni, and VITA-1.5) using over 40 hours of generated speech, with 293 simulated conversations and 1,200 interruptions. The results show that all models continue to face challenges, such as failing to respond to user interruptions, under frequent disruptions and noisy conditions. Demonstrations, data, and code will be released.

View on arXiv PDF

Similar