SDIRLGMMASJan 17, 2024

On the Effect of Data-Augmentation on Local Embedding Properties in the Contrastive Learning of Music Audio Representations

arXiv:2401.08889v110 citationsh-index: 2ICASSP
Originality Incremental advance
AI Analysis

This work addresses the design of audio embeddings for music search and recommendation, focusing on local properties important for nearest neighbor algorithms, but it is incremental as it builds on existing contrastive learning methods.

The study tackled the problem of local embedding properties in contrastive learning for music audio representations, showing that data augmentation can reduce locality of homogeneous track properties like key and tempo while increasing locality of salient features like genre and mood, achieving state-of-the-art performance in nearest neighbor retrieval accuracy.

Audio embeddings are crucial tools in understanding large catalogs of music. Typically embeddings are evaluated on the basis of the performance they provide in a wide range of downstream tasks, however few studies have investigated the local properties of the embedding spaces themselves which are important in nearest neighbor algorithms, commonly used in music search and recommendation. In this work we show that when learning audio representations on music datasets via contrastive learning, musical properties that are typically homogeneous within a track (e.g., key and tempo) are reflected in the locality of neighborhoods in the resulting embedding space. By applying appropriate data augmentation strategies, localisation of such properties can not only be reduced but the localisation of other attributes is increased. For example, locality of features such as pitch and tempo that are less relevant to non-expert listeners, may be mitigated while improving the locality of more salient features such as genre and mood, achieving state-of-the-art performance in nearest neighbor retrieval accuracy. Similarly, we show that the optimal selection of data augmentation strategies for contrastive learning of music audio embeddings is dependent on the downstream task, highlighting this as an important embedding design decision.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes