Mix and Localize: Localizing Sound Sources in Mixtures
This addresses the challenge of audio-visual source localization for applications like robotics or multimedia analysis, but it is incremental as it builds on existing contrastive random walk methods.
The paper tackles the problem of localizing multiple sound sources within a visual scene by jointly grouping sound mixtures and associating them with visual signals, and shows that their model outperforms other self-supervised methods in experiments with musical instruments and human speech.
We present a method for simultaneously localizing multiple sound sources within a visual scene. This task requires a model to both group a sound mixture into individual sources, and to associate them with a visual signal. Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al. We create a graph in which images and separated sounds correspond to nodes, and train a random walker to transition between nodes from different modalities with high return probability. The transition probabilities for this walk are determined by an audio-visual similarity metric that is learned by our model. We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds, outperforming other self-supervised methods. Project site: https://hxixixh.github.io/mix-and-localize