CLMay 22, 2025

IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

arXiv:2505.16774v24 citationsh-index: 8IJCNLP-AACL
Originality Synthesis-oriented
AI Analysis

This work addresses the need for standardized evaluation in audio-based LLMs, which is an incremental step for researchers in multimodal AI.

The authors tackled the problem of evaluating instruction-following capabilities in audio-based large language models, which are underexplored compared to text and vision-language models, by introducing IFEval-Audio, a dataset of 280 audio-instruction-answer triples across six dimensions, and benchmarking state-of-the-art models on it.

Large language models (LLMs) have demonstrated strong instruction-following capabilities in text-based tasks. However, this ability often deteriorates in multimodal models after alignment with non-text modalities such as images or audio. While several recent efforts have investigated instruction-following performance in text and vision-language models, instruction-following in audio-based large language models remains largely unexplored. To bridge this gap, we introduce IFEval-Audio, a novel evaluation dataset designed to assess the ability to follow instructions in an audio LLM. IFEval-Audio contains 280 audio-instruction-answer triples across six diverse dimensions: Content, Capitalization, Symbol, List Structure, Length, and Format. Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure. We benchmark state-of-the-art audio LLMs on their ability to follow audio-involved instructions. The dataset is released publicly to support future research in this emerging area.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes