CVSDApr 21, 2025

Improving Sound Source Localization with Joint Slot Attention on Image and Audio

arXiv:2504.15118v22 citationsh-index: 4CVPR
Originality Incremental advance
AI Analysis

This work improves sound source localization for applications like robotics and surveillance, but it is incremental as it builds on existing contrastive learning methods.

The paper tackled the problem of sound source localization by addressing suboptimal embeddings due to noise and irrelevant background, achieving state-of-the-art performance on three public benchmarks and substantially outperforming prior work in cross-modal retrieval.

Sound source localization (SSL) is the task of locating the source of sound within an image. Due to the lack of localization labels, the de facto standard in SSL has been to represent an image and audio as a single embedding vector each, and use them to learn SSL via contrastive learning. To this end, previous work samples one of local image features as the image embedding and aggregates all local audio features to obtain the audio embedding, which is far from optimal due to the presence of noise and background irrelevant to the actual target in the input. We present a novel SSL method that addresses this chronic issue by joint slot attention on image and audio. To be specific, two slots competitively attend image and audio features to decompose them into target and off-target representations, and only target representations of image and audio are used for contrastive learning. Also, we introduce cross-modal attention matching to further align local features of image and audio. Our method achieved the best in almost all settings on three public benchmarks for SSL, and substantially outperformed all the prior work in cross-modal retrieval.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes