CVMar 23, 2018

Audio-Visual Event Localization in Unconstrained Videos

arXiv:1803.08842v1595 citations
Originality Highly original
AI Analysis

It addresses audio-visual event localization for video analysis, presenting incremental advancements with new tasks and a dataset.

The paper tackles the problem of localizing events that are both visible and audible in unconstrained videos, introducing a new dataset and methods that achieve improved performance through joint audio-visual modeling and fusion.

In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event(AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization. We develop an audio-guided visual attention mechanism to explore audio-visual correlations, propose a dual multimodal residual network (DMRN) to fuse information over the two modalities, and introduce an audio-visual distance learning network to handle the cross-modality localization. Our experiments support the following findings: joint modeling of auditory and visual modalities outperforms independent modeling, the learned attention can capture semantics of sounding objects, temporal alignment is important for audio-visual fusion, the proposed DMRN is effective in fusing audio-visual features, and strong correlations between the two modalities enable cross-modality localization.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes