CVSDASIVMar 25, 2021

Weakly-supervised Audio-visual Sound Source Detection and Separation

arXiv:2104.02606v18 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of audio-visual sound source detection and separation for applications like video analysis, though it is incremental as it builds on existing weakly-supervised frameworks.

The paper tackles the problem of localizing and separating individual object sounds in videos using only object labels as supervision, proposing an audio-visual co-segmentation method that outperforms state-of-the-art approaches on the MUSIC dataset for sound source separation and denoising.

Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task. Current state-of-the-art methods predict audio masks from artificially mixed spectrograms, known as Mix-and-Separate framework. We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like, from videos labeled with only object labels. Unlike other recent visually-guided audio source separation frameworks, our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals. Specifically, we introduce weakly-supervised object segmentation in the context of sound separation. We also formulate spectrogram mask prediction using a set of learned mask bases, which combine using coefficients conditioned on the output of object segmentation , a design that facilitates separation. Extensive experiments on the MUSIC dataset show that our proposed approach outperforms state-of-the-art methods on visually guided sound source separation and sound denoising.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes