SDAIASMar 17, 2021

Self-Supervised Learning of Audio Representations from Permutations with Differentiable Ranking

arXiv:2103.09879v129 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of learning audio representations without labels, which is incremental as it builds on existing permutation-based pretext tasks by enabling full permutation exploitation.

The paper tackles the problem of self-supervised learning for audio by pre-training a model to reorder shuffled spectrogram parts using differentiable ranking, which improves downstream classification performance, achieving gains in instrument classification and pitch estimation.

Self-supervised pre-training using so-called "pretext" tasks has recently shown impressive performance across a wide range of modalities. In this work, we advance self-supervised learning from permutations, by pre-training a model to reorder shuffled parts of the spectrogram of an audio signal, to improve downstream classification performance. We make two main contributions. First, we overcome the main challenges of integrating permutation inversions into an end-to-end training scheme, using recent advances in differentiable ranking. This was heretofore sidestepped by casting the reordering task as classification, fundamentally reducing the space of permutations that can be exploited. Our experiments validate that learning from all possible permutations improves the quality of the pre-trained representations over using a limited, fixed set. Second, we show that inverting permutations is a meaningful pretext task for learning audio representations in an unsupervised fashion. In particular, we improve instrument classification and pitch estimation of musical notes by reordering spectrogram patches in the time-frequency space.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes