LGSDSep 25, 2025

Investigating Modality Contribution in Audio LLMs for Music

arXiv:2509.20641v12 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses the explainability of Audio LLMs for music analysis, though it is incremental as it applies an existing framework to a new domain.

The paper investigates whether Audio LLMs rely on audio or text for music-related tasks by quantifying modality contributions using MM-SHAP, finding that higher-accuracy models depend more on text but still localize key audio events.

Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model's output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model's prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes