CVNov 6, 2022

Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization

arXiv:2211.03019v116 citationsh-index: 50Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of audio-visual source localization for applications in multimedia analysis, but it is incremental as it builds on existing attention-based methods by incorporating optical flow.

The paper tackles the problem of localizing sound sources in videos without explicit annotations by using optical flow as a prior to improve correlation between audio and visual modalities, achieving state-of-the-art performance on standard datasets like Soundnet Flickr and VGG Sound Source.

Learning to localize the sound source in videos without explicit annotations is a novel area of audio-visual research. Existing work in this area focuses on creating attention maps to capture the correlation between the two modalities to localize the source of the sound. In a video, oftentimes, the objects exhibiting movement are the ones generating the sound. In this work, we capture this characteristic by modeling the optical flow in a video as a prior to better aid in localizing the sound source. We further demonstrate that the addition of flow-based attention substantially improves visual sound source localization. Finally, we benchmark our method on standard sound source localization datasets and achieve state-of-the-art performance on the Soundnet Flickr and VGG Sound Source datasets. Code: https://github.com/denfed/heartheflow.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes