CLJun 27, 2025

WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild

Linhao Zhang, Jian Zhang, Bokai Lei, Chuhan Wu, Aiwei Liu, Wei Jia, Xiao Zhou

arXiv:2506.21875v39 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work addresses the problem of optimizing user experience for Audio LLMs in real-world applications by providing a domain-specific benchmark for researchers and developers, though it is incremental as it builds on existing evaluation methods by adding speech-specific features.

The authors tackled the lack of specialized benchmarks for evaluating end-to-end speech large language models (LLMs) in real-world applications by introducing WildSpeech-Bench, a comprehensive benchmark that systematically assesses these models in practical speech conversations, revealing significant performance differences across various speech scenarios.

Recent multi-modal Large Language Models (LLMs) such as GPT-4o have demonstrated strong capabilities of direct speech interaction. However, the lack of specialized and comprehensive benchmarks for end-to-end speech LLM evaluation hinders optimizing the user experience of Audio LLMs in real-world applications. Existing evaluation methods often adapt text-based benchmarks, overlooking speech's unique characteristics and challenges, including prosody, homophones, stuttering, and differing user expectations. Here, we introduce the first comprehensive benchmark designed to systematically evaluate end-to-end speechLLMs in practical speech conversations. We systematically curate real-world chat data relevant to spoken scenarios, introduce diversity in speaker attributes and acoustic conditions, and augment the dataset with speech-specific phenomena. We further design a query-aware evaluation method to use customized evaluation checklists and prompts to enhance the accuracy of automatic evaluation. We conduct comprehensive testing and detailed analysis of various mainstream speech models, revealing significant differences in model performance across different speech scenarios. The use of query-aware evaluation further enables a finer-grained assessment under various speech-specific scenarios. Our benchmark can provide valuable insights for speech model development and evaluation.

View on arXiv PDF

Similar