LGSDJun 23, 2025

Benchmarking Music Generation Models and Metrics via Human Preference Studies

arXiv:2506.19085v118 citationsh-index: 24ICASSP
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of subjective evaluation in music generation for researchers and practitioners, providing a benchmark and open dataset, though it is incremental in focusing on existing models and metrics.

The authors tackled the problem of evaluating music generation models by generating 6,000 songs with 12 state-of-the-art models and conducting a survey with 2,500 participants to compare human preferences against existing metrics, resulting in the first ranking of models and metrics based on human preference.

Recent advancements have brought generated music closer to human-created compositions, yet evaluating these models remains challenging. While human preference is the gold standard for assessing quality, translating these subjective judgments into objective metrics, particularly for text-audio alignment and music quality, has proven difficult. In this work, we generate 6k songs using 12 state-of-the-art models and conduct a survey of 15k pairwise audio comparisons with 2.5k human participants to evaluate the correlation between human preferences and widely used metrics. To the best of our knowledge, this work is the first to rank current state-of-the-art music generation models and metrics based on human preference. To further the field of subjective metric evaluation, we provide open access to our dataset of generated music and human evaluations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes