ASLGSDMar 11, 2021

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

arXiv:2103.06695v2207 citations
Originality Incremental advance
AI Analysis

This work addresses the need for efficient audio representation learning for applications in audio processing, though it is incremental as it adapts an existing vision method to audio.

The paper tackled the problem of learning general-purpose audio representations without relying on relationships between different time segments, proposing BYOL-A, a self-supervised method based on BYOL that creates contrasts from a single audio segment. It achieved state-of-the-art results in various downstream tasks through a combination of normalization and augmentation techniques.

Inspired by the recent progress in self-supervised learning for computer vision that generates supervision using data augmentations, we explore a new general-purpose audio representation learning approach. We propose learning general-purpose audio representation from a single audio segment without expecting relationships between different time segments of audio samples. To implement this principle, we introduce Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced "viola"), an audio self-supervised learning method based on BYOL for learning general-purpose audio representation. Unlike most previous audio self-supervised learning methods that rely on agreement of vicinity audio segments or disagreement of remote ones, BYOL-A creates contrasts in an augmented audio segment pair derived from a single audio segment. With a combination of normalization and augmentation techniques, BYOL-A achieves state-of-the-art results in various downstream tasks. Extensive ablation studies also clarified the contribution of each component and their combinations.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes