CL SD ASJan 9, 2025

VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models

Wenqian Cui, Xiaoqi Jiao, Ziqiao Meng, Irwin King

arXiv:2501.04962v417.018 citationsh-index: 12Has CodeACL

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of evaluating knowledge understanding in speech-based AI models for researchers and developers, but it is incremental as it builds on existing QA benchmarks by adapting them to speech format.

The authors tackled the lack of benchmarks for evaluating knowledge understanding in end-to-end spoken language models (SLMs) by introducing VoxEval, a SpeechQA benchmark that uses pure speech interactions and diverse audio conditions, revealing significant challenges for current SLMs, such as sensitivity to audio variations and limited reasoning capabilities.

With the rising need for speech-based interaction models, end-to-end Spoken Language Models (SLMs) have emerged as a promising solution. While these models require comprehensive world knowledge for meaningful and reliable human interactions, existing question-answering (QA) benchmarks fall short in evaluating SLMs' knowledge understanding due to their inability to support end-to-end speech evaluation and account for varied input audio conditions. To address these limitations, we present VoxEval, a novel SpeechQA benchmark that assesses SLMs' knowledge understanding through pure speech interactions. Our benchmark 1) uniquely maintains speech format for both inputs and outputs, 2) evaluates model robustness across diverse input audio conditions, and 3) pioneers the assessment of complex tasks like mathematical reasoning in spoken format. Systematic evaluation demonstrates that VoxEval presents significant challenges to current SLMs, revealing their sensitivity to varying audio conditions and highlighting the need to enhance reasoning capabilities in future development. We hope this benchmark could guide the advancement of more sophisticated and reliable SLMs. VoxEval dataset is available at: https://github.com/dreamtheater123/VoxEval

View on arXiv PDF Code

Similar