SDAIASMar 20, 2025

Aligning Text-to-Music Evaluation with Human Preferences

arXiv:2503.16669v114 citationsh-index: 27Has CodeISMIR
Originality Incremental advance
AI Analysis

This addresses the challenge of robust evaluation for text-to-music systems, which is crucial for researchers and developers in generative AI and music technology, though it is incremental as it improves upon existing evaluation methods rather than introducing a new paradigm.

The paper tackles the problem of evaluating text-to-music models by showing that existing metrics like Fréchet Audio Distance are inconsistent and weakly correlated with human preferences, and proposes a new metric, MAUVE Audio Divergence, which achieves an average rank correlation of 0.84 on musical desiderata and 0.62 correlation with human preferences.

Despite significant recent advances in generative acoustic text-to-music (TTM) modeling, robust evaluation of these models lags behind, relying in particular on the popular Fréchet Audio Distance (FAD). In this work, we rigorously study the design space of reference-based divergence metrics for evaluating TTM models through (1) designing four synthetic meta-evaluations to measure sensitivity to particular musical desiderata, and (2) collecting and evaluating on MusicPrefs, the first open-source dataset of human preferences for TTM systems. We find that not only is the standard FAD setup inconsistent on both synthetic and human preference data, but that nearly all existing metrics fail to effectively capture desiderata, and are only weakly correlated with human perception. We propose a new metric, the MAUVE Audio Divergence (MAD), computed on representations from a self-supervised audio embedding model. We find that this metric effectively captures diverse musical desiderata (average rank correlation 0.84 for MAD vs. 0.49 for FAD and also correlates more strongly with MusicPrefs (0.62 vs. 0.14).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes