CLMay 21, 2025

VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models

Heyang Liu, Yuhao Wang, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

arXiv:2505.15727v218.817 citationsh-index: 10Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of inadequate benchmarking for speech conversational abilities in AI systems, though it is incremental as it builds on existing evaluation frameworks.

The authors tackled the lack of realistic evaluations for speech interaction models by proposing VocalBench, a benchmark with 9,400 instances across four dimensions, and found significant variability in 15 mainstream systems.

The rapid advancement of large language models (LLMs) has accelerated the development of multimodal models capable of speech communications. Unlike text interactions, speech conveys diverse information, including acoustic variations, paralanguage cues, and environmental context. However, existing evaluations of speech interaction models lack instances mimicking real scenarios and predominantly focus on the quality of their textual responses, overlooking critical aspects of vocal performance. To address this gap, we propose VocalBench, a comprehensive benchmark to assess the speech conversational abilities, comprising 9,400 carefully curated instances across four key dimensions: semantic quality, acoustic performance, conversational abilities, and robustness. It covers a broad range of fundamental skills essential for effective vocal interactions. For the evaluation scheme, we propose several objective evaluation indicators and incorporate an additional LLM-as-a-judge approach to score open-ended questions. Experimental results on 15 mainstream systems reveal significant variability, each exhibiting distinct strengths and weaknesses, and provide valuable insights to guide future research in speech interaction systems.

View on arXiv PDF Code

Similar