SDAILGAug 12, 2025

QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems

arXiv:2508.08957v12 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the problem of subjective and multi-dimensional human perception in audio generation assessment for researchers and practitioners, though it appears incremental as it builds on existing pre-trained models and datasets.

The paper tackled the challenge of evaluating audio generation systems by addressing the limitations of existing regression-based methods for mean opinion score prediction, introducing QAMRO, which demonstrated superior alignment with human evaluations across all dimensions and significantly outperformed baseline models.

Evaluating audio generation systems, including text-to-music (TTM), text-to-speech (TTS), and text-to-audio (TTA), remains challenging due to the subjective and multi-dimensional nature of human perception. Existing methods treat mean opinion score (MOS) prediction as a regression problem, but standard regression losses overlook the relativity of perceptual judgments. To address this limitation, we introduce QAMRO, a novel Quality-aware Adaptive Margin Ranking Optimization framework that seamlessly integrates regression objectives from different perspectives, aiming to highlight perceptual differences and prioritize accurate ratings. Our framework leverages pre-trained audio-text models such as CLAP and Audiobox-Aesthetics, and is trained exclusively on the official AudioMOS Challenge 2025 dataset. It demonstrates superior alignment with human evaluations across all dimensions, significantly outperforming robust baseline models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes