ASCLJul 25, 2025

FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems

arXiv:2507.19040v110 citationsh-index: 16Has CodeINTERSPEECH
Originality Synthesis-oriented
AI Analysis

This addresses the problem of evaluating natural human-machine interactions for researchers and developers in spoken dialogue systems, but it is incremental as it focuses on benchmarking rather than new model development.

The paper tackles the lack of benchmarks for full-duplex spoken dialogue systems by introducing FD-Bench, a pipeline using LLMs, TTS, and ASR to evaluate models on user interruptions and delays, applied to three systems with over 40 hours of speech and 1,200 interruptions, showing all models struggle with interruptions and noise.

Full-duplex spoken dialogue systems (FDSDS) enable more natural human-machine interactions by allowing real-time user interruptions and backchanneling, compared to traditional SDS that rely on turn-taking. However, existing benchmarks lack metrics for FD scenes, e.g., evaluating model performance during user interruptions. In this paper, we present a comprehensive FD benchmarking pipeline utilizing LLMs, TTS, and ASR to address this gap. It assesses FDSDS's ability to handle user interruptions, manage delays, and maintain robustness in challenging scenarios with diverse novel metrics. We applied our benchmark to three open-source FDSDS (Moshi, Freeze-omni, and VITA-1.5) using over 40 hours of generated speech, with 293 simulated conversations and 1,200 interruptions. The results show that all models continue to face challenges, such as failing to respond to user interruptions, under frequent disruptions and noisy conditions. Demonstrations, data, and code will be released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes