CVMMSDASMar 30, 2022

The Sound of Bounding-Boxes

arXiv:2203.15991v1
Originality Incremental advance
AI Analysis

This work addresses the limitation of existing methods that depend on pre-trained object detectors, making it applicable to arbitrary object categories without additional annotation, though it is incremental in its approach.

The paper tackles the problem of audio-visual sound source separation by proposing a fully unsupervised method that simultaneously learns to detect objects in images and separate sound sources, eliminating reliance on pre-trained object detectors and achieving comparable separation accuracy.

In the task of audio-visual sound source separation, which leverages visual information for sound source separation, identifying objects in an image is a crucial step prior to separating the sound source. However, existing methods that assign sound on detected bounding boxes suffer from a problem that their approach heavily relies on pre-trained object detectors. Specifically, when using these existing methods, it is required to predetermine all the possible categories of objects that can produce sound and use an object detector applicable to all such categories. To tackle this problem, we propose a fully unsupervised method that learns to detect objects in an image and separate sound source simultaneously. As our method does not rely on any pre-trained detector, our method is applicable to arbitrary categories without any additional annotation. Furthermore, although being fully unsupervised, we found that our method performs comparably in separation accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes