SDLGApr 11

Masked Contrastive Pre-Training Improves Music Audio Key Detection

arXiv:2604.1002114.1h-index: 1
Predicted impact top 71% in SD · last 90 daysOriginality Incremental advance
AI Analysis

For music information retrieval researchers, this work demonstrates that self-supervised pretraining can effectively address pitch-sensitive tasks like key detection, offering a simpler alternative to supervised methods.

The paper shows that masked contrastive pretraining on Mel spectrograms produces pitch-sensitive representations, enabling state-of-the-art key detection with simple MLPs, outperforming prior methods without complex augmentations.

Self-supervised music foundation models underperform on key detection, which requires pitch-sensitive representations. In this work, we present the first systematic study showing that the design of self-supervised pretraining directly impacts pitch sensitivity, and demonstrate that masked contrastive embeddings uniquely enable state-of-the-art (SOTA) performance in key detection in the supervised setting. First, we discover that linear evaluation after masking-based contrastive pretraining on Mel spectrograms leads to competitive performance on music key detection out of the box. This leads us to train shallow but wide multi-layer perceptrons (MLPs) on features extracted from our base model, leading to SOTA performance without the need for sophisticated data augmentation policies. We further analyze robustness and show empirically that the learned representations naturally encode common augmentations. Our study establishes self-supervised pretraining as an effective approach for pitch-sensitive MIR tasks and provides insights for designing and probing music foundation models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes