QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems
This work addresses the problem of subjective and multi-dimensional human perception in audio generation assessment for researchers and practitioners, though it appears incremental as it builds on existing pre-trained models and datasets.
The paper tackled the challenge of evaluating audio generation systems by addressing the limitations of existing regression-based methods for mean opinion score prediction, introducing QAMRO, which demonstrated superior alignment with human evaluations across all dimensions and significantly outperformed baseline models.
Evaluating audio generation systems, including text-to-music (TTM), text-to-speech (TTS), and text-to-audio (TTA), remains challenging due to the subjective and multi-dimensional nature of human perception. Existing methods treat mean opinion score (MOS) prediction as a regression problem, but standard regression losses overlook the relativity of perceptual judgments. To address this limitation, we introduce QAMRO, a novel Quality-aware Adaptive Margin Ranking Optimization framework that seamlessly integrates regression objectives from different perspectives, aiming to highlight perceptual differences and prioritize accurate ratings. Our framework leverages pre-trained audio-text models such as CLAP and Audiobox-Aesthetics, and is trained exclusively on the official AudioMOS Challenge 2025 dataset. It demonstrates superior alignment with human evaluations across all dimensions, significantly outperforming robust baseline models.