CVSDASJun 11, 2020

Telling Left from Right: Learning Spatial Correspondence of Sight and Sound

arXiv:2006.06175v288 citations
AI Analysis

This work addresses the challenge of spatial audio-visual learning for video analysis, offering a novel approach that enhances model capabilities in tasks like localization, though it is incremental in building upon existing self-supervised methods.

The paper tackles the problem of learning spatial correspondence between sight and sound in videos by proposing a self-supervised task to match spatial audio information to visual sound source positions, resulting in improved performance on three audio-visual tasks with quantitative gains over baselines.

Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs. Existing approaches have focused primarily on matching semantic information between the sensory streams. We propose a novel self-supervised task to leverage an orthogonal principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream. Our approach is simple yet effective. We train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams. To train and evaluate our method, we introduce a large-scale video dataset, YouTube-ASMR-300K, with spatial audio comprising over 900 hours of footage. We demonstrate that understanding spatial correspondence enables models to perform better on three audio-visual tasks, achieving quantitative gains over supervised and self-supervised baselines that do not leverage spatial audio cues. We also show how to extend our self-supervised approach to 360 degree videos with ambisonic audio.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes