HumMusQA: A Human-written Music Understanding QA Benchmark Dataset
For researchers evaluating music understanding in LALMs, this dataset provides a more rigorous benchmark than existing automated methods, though the small scale limits generalizability.
The paper introduces HumMusQA, a dataset of 320 expert-curated music understanding questions, and benchmarks six LALMs, finding that models struggle with complex music comprehension and rely on uni-modal shortcuts.
The evaluation of music understanding in Large Audio-Language Models (LALMs) requires a rigorously defined benchmark that truly tests whether models can perceive and interpret music, a standard that current data methodologies frequently fail to meet. This paper introduces a meticulously structured approach to music evaluation, proposing a new dataset of 320 hand-written questions curated and validated by experts with musical training, arguing that such focused, manual curation is superior for probing complex audio comprehension. To demonstrate the use of the dataset, we benchmark six state-of-the-art LALMs and additionally test their robustness to uni-modal shortcuts.