Hear: Hierarchically Enhanced Aesthetic Representations For Multidimensional Music Evaluation
This addresses the problem of multidimensional music evaluation for researchers and practitioners, but it appears incremental as it builds on existing benchmarks and methods.
The paper tackles the challenge of evaluating song aesthetics by proposing HEAR, a framework that combines multi-scale representations, hierarchical augmentation, and hybrid training, and it consistently outperforms baselines on the ICASSP 2026 SongEval benchmark.
Evaluating song aesthetics is challenging due to the multidimensional nature of musical perception and the scarcity of labeled data. We propose HEAR, a robust music aesthetic evaluation framework that combines: (1) a multi-source multi-scale representations module to obtain complementary segment- and track-level features, (2) a hierarchical augmentation strategy to mitigate overfitting, and (3) a hybrid training objective that integrates regression and ranking losses for accurate scoring and reliable top-tier song identification. Experiments demonstrate that HEAR consistently outperforms the baseline across all metrics on both tracks of the ICASSP 2026 SongEval benchmark. The code and trained model weights are available at https://github.com/Eps-Acoustic-Revolution-Lab/EAR_HEAR.