MMApr 30, 2021

Cross-Modal Music-Video Recommendation: A Study of Design Choices

arXiv:2104.14799v125 citations
AI Analysis

This is an incremental improvement for music and video recommendation systems, focusing on design choices in a specific domain.

The study tackled music-video cross-modal recommendation by building on an existing system and testing audio representation learning and loss functions, finding that learned audio embeddings improved recommendations over handcrafted features.

In this work, we study music/video cross-modal recommendation, i.e. recommending a music track for a video or vice versa. We rely on a self-supervised learning paradigm to learn from a large amount of unlabelled data. We rely on a self-supervised learning paradigm to learn from a large amount of unlabelled data. More precisely, we jointly learn audio and video embeddings by using their co-occurrence in music-video clips. In this work, we build upon a recent video-music retrieval system (the VM-NET), which originally relies on an audio representation obtained by a set of statistics computed over handcrafted features. We demonstrate here that using audio representation learning such as the audio embeddings provided by the pre-trained MuSimNet, OpenL3, MusicCNN or by AudioSet, largely improves recommendations. We also validate the use of the cross-modal triplet loss originally proposed in the VM-NET compared to the binary cross-entropy loss commonly used in self-supervised learning. We perform all our experiments using the Music Video Dataset (MVD).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes