CVLGSDOct 27, 2016

SoundNet: Learning Sound Representations from Unlabeled Video

arXiv:1610.09001v11105 citations
Originality Highly original
AI Analysis

This work addresses the challenge of acquiring labeled sound data for audio recognition tasks, offering a scalable solution for researchers and practitioners in audio processing.

The authors tackled the problem of learning sound representations without labeled data by using two-million unlabeled videos and a student-teacher training procedure, resulting in significant performance improvements on acoustic scene/object classification benchmarks.

We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound. We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge. Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification. Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.

Code Implementations6 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes