SDAISep 27, 2025

ABC-Eval: Benchmarking Large Language Models on Symbolic Music Understanding and Instruction Following

arXiv:2509.23350v12 citationsh-index: 3Has Code
Originality Synthesis-oriented
AI Analysis

This addresses a gap in evaluating LLMs for symbolic music understanding, which is incremental as it introduces a new benchmark rather than a novel method.

The authors tackled the underexplored problem of large language models' understanding and reasoning about symbolic music in text-based ABC notation, by proposing ABC-Eval, a benchmark with 1,086 test samples across 10 sub-tasks, and found notable limitations in seven state-of-the-art models' capabilities.

As large language models continue to develop, the feasibility and significance of text-based symbolic music tasks have become increasingly prominent. While symbolic music has been widely used in generation tasks, LLM capabilities in understanding and reasoning about symbolic music remain largely underexplored. To address this gap, we propose ABC-Eval, the first open-source benchmark dedicated to the understanding and instruction-following capabilities in text-based ABC notation scores. It comprises 1,086 test samples spanning 10 sub-tasks, covering scenarios from basic musical syntax comprehension to complex sequence-level reasoning. Such a diverse scope poses substantial challenges to models' ability to handle symbolic music tasks. We evaluated seven state-of-the-art LLMs on ABC-Eval, and the results reveal notable limitations in existing models' symbolic music processing capabilities. Furthermore, the consistent performance of individual baselines across different sub-tasks supports the reliability of our benchmark.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes