Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings
This work addresses domain shift in audio quality assessment for generative audio systems, which is incremental as it builds on existing methods like BEATs and triplet loss.
The paper tackled the problem of predicting multiple perceptual quality scores for generative audio, such as text-to-speech and text-to-music, by addressing domain shift between natural training and synthetic evaluation data, resulting in improved embedding discriminability and generalization without synthetic training data.
We present a system for automatic multi-axis perceptual quality prediction of generative audio, developed for Track 2 of the AudioMOS Challenge 2025. The task is to predict four Audio Aesthetic Scores--Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness--for audio generated by text-to-speech (TTS), text-to-audio (TTA), and text-to-music (TTM) systems. A main challenge is the domain shift between natural training data and synthetic evaluation data. To address this, we combine BEATs, a pretrained transformer-based audio representation model, with a multi-branch long short-term memory (LSTM) predictor and use a triplet loss with buffer-based sampling to structure the embedding space by perceptual similarity. Our results show that this improves embedding discriminability and generalization, enabling domain-robust audio quality assessment without synthetic training data.