AISDASOct 21, 2025

The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS

arXiv:2510.19055v16 citationsh-index: 2Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the need for better evaluation tools in audio AI to identify weaknesses in multimodal models, though it is incremental as it focuses on benchmarking rather than new methods.

The authors introduced the MUSE benchmark to evaluate music perception and relational reasoning in audio LLMs, revealing significant performance gaps between SOTA models and human experts, with some models performing near chance.

Multimodal Large Language Models (MLLMs) have demonstrated capabilities in audio understanding, but current evaluations may obscure fundamental weaknesses in relational reasoning. We introduce the Music Understanding and Structural Evaluation (MUSE) Benchmark, an open-source resource with 10 tasks designed to probe fundamental music perception skills. We evaluate four SOTA models (Gemini Pro and Flash, Qwen2.5-Omni, and Audio-Flamingo 3) against a large human baseline (N=200). Our results reveal a wide variance in SOTA capabilities and a persistent gap with human experts. While Gemini Pro succeeds on basic perception, Qwen and Audio Flamingo 3 perform at or near chance, exposing severe perceptual deficits. Furthermore, we find Chain-of-Thought (CoT) prompting provides inconsistent, often detrimental results. Our work provides a critical tool for evaluating invariant musical representations and driving development of more robust AI systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes