SDMay 26

MERIT: Learning Disentangled Music Representations for Audio Similarity

arXiv:2605.2734647.6
Predicted impact top 60% in SD · last 90 daysOriginality Highly original
AI Analysis

For music information retrieval and recommendation, MERIT provides interpretable and controllable similarity by disentangling core musical dimensions, addressing a key limitation of monolithic similarity models.

MERIT learns disentangled music representations for melody, rhythm, and timbre, enabling factor-specific similarity queries. The model achieves strong disentanglement, with each dimension responding to its intended perceptual factor while near chance on others, validated on both synthetic and real-world audio.

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes