SDLGASMay 31, 2023

Learning Music Sequence Representation from Text Supervision

arXiv:2305.19602v115 citations
Originality Highly original
AI Analysis

This addresses the challenge of music representation learning for downstream tasks, offering a more flexible and data-efficient approach compared to existing methods.

The paper tackles the problem of learning music sequence representations by proposing MUSER, a text-supervision pre-training method that uses an audio-spectrum-text tri-modal contrastive learning framework, achieving state-of-the-art performance with only 0.056% of pre-training data.

Music representation learning is notoriously difficult for its complex human-related concepts contained in the sequence of numerical signals. To excavate better MUsic SEquence Representation from labeled audio, we propose a novel text-supervision pre-training method, namely MUSER. MUSER adopts an audio-spectrum-text tri-modal contrastive learning framework, where the text input could be any form of meta-data with the help of text templates while the spectrum is derived from an audio sequence. Our experiments reveal that MUSER could be more flexibly adapted to downstream tasks compared with the current data-hungry pre-training method, and it only requires 0.056% of pre-training data to achieve the state-of-the-art performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes