CVNov 28, 2022

Mix and Localize: Localizing Sound Sources in Mixtures

arXiv:2211.15058v168 citationsh-index: 26
Originality Incremental advance
AI Analysis

This addresses the challenge of audio-visual source localization for applications like robotics or multimedia analysis, but it is incremental as it builds on existing contrastive random walk methods.

The paper tackles the problem of localizing multiple sound sources within a visual scene by jointly grouping sound mixtures and associating them with visual signals, and shows that their model outperforms other self-supervised methods in experiments with musical instruments and human speech.

We present a method for simultaneously localizing multiple sound sources within a visual scene. This task requires a model to both group a sound mixture into individual sources, and to associate them with a visual signal. Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al. We create a graph in which images and separated sounds correspond to nodes, and train a random walker to transition between nodes from different modalities with high return probability. The transition probabilities for this walk are determined by an audio-visual similarity metric that is learned by our model. We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds, outperforming other self-supervised methods. Project site: https://hxixixh.github.io/mix-and-localize

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes