CLSep 26, 2025

FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction

Yuan Ge, Saihan Chen, Jingqi Xiao, Xiaoqian Liu, Tong Xiao, Yan Xiang, Zhengtao Yu, Jingbo Zhu

arXiv:2509.22243v19.66 citationsh-index: 10Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of evaluating real-time spoken dialogue systems for researchers and developers, though it appears incremental as it builds on existing full-duplex interaction concepts with a new benchmark.

The paper tackles the challenge of benchmarking full-duplex speech-to-speech LLMs for natural human-computer interaction by introducing FLEXI, the first benchmark that incorporates model interruption in emergency scenarios, revealing significant gaps between open-source and commercial models in areas like emergency awareness and latency.

Full-Duplex Speech-to-Speech Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling real-time spoken dialogue systems. However, benchmarking and modeling these models remains a fundamental challenge. We introduce FLEXI, the first benchmark for full-duplex LLM-human spoken interaction that explicitly incorporates model interruption in emergency scenarios. FLEXI systematically evaluates the latency, quality, and conversational effectiveness of real-time dialogue through six diverse human-LLM interaction scenarios, revealing significant gaps between open source and commercial models in emergency awareness, turn terminating, and interaction latency. Finally, we suggest that next token-pair prediction offers a promising path toward achieving truly seamless and human-like full-duplex interaction.

View on arXiv PDF

Similar