SD AIFeb 2

When Noise Lowers The Loss: Rethinking Likelihood-Based Evaluation in Music Large Language Models

arXiv:2602.02738v1

Originality Incremental advance

AI Analysis

This addresses the need for robust evaluation methods in music LLMs to distinguish high-quality compositions from garbage music, offering a label-free, model-intrinsic framework that could improve training objectives and benchmarks, though it is incremental as it builds on existing loss-based metrics.

The paper tackled the problem that standard cross-entropy loss decreases for corrupted music in music large language models, undermining its use for evaluating output quality, and found that the shape of the loss curve, particularly a sharp increase for short noise injections, serves as a proxy for discerning musical integrity, with experiments confirming stronger responses to local disruptions than global corruption.

The rise of music large language models (LLMs) demands robust methods of evaluating output quality, especially in distinguishing high-quality compositions from "garbage music". Curiously, we observe that the standard cross-entropy loss -- a core training metric -- often decrease when models encounter systematically corrupted music, undermining its validity as a standalone quality indicator. To investigate this paradox, we introduce noise injection experiment, where controlled noise signal of varying lengths are injected into musical contexts. We hypothesize that a model's loss reacting positively to these perturbations, specifically a sharp increase ("Peak" area) for short injection, can serve as a proxy for its ability to discern musical integrity. Experiments with MusicGen models in the audio waveform domain confirm that Music LLMs respond more strongly to local, texture-level disruptions than to global semantic corruption. Beyond exposing this bias, our results highlight a new principle: the shape of the loss curve -- rather than its absolute value -- encodes critical information about the quality of the generated content (i.e., model behavior). We envision this profile-based evaluation as a label-free, model-intrinsic framework for assessing musical quality -- opening the door to more principled training objectives and sharper benchmarks.

View on arXiv PDF

Similar