CVLGSDASJul 12, 2023

Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event Localization

arXiv:2307.06385v2h-index: 3
AI Analysis

This work addresses the problem of localizing events in videos without precise temporal annotations, which is relevant for multimedia analysis, but it is incremental as it builds on existing weakly-supervised techniques.

The paper tackles weakly-supervised audio-visual event localization by proposing a temporal label-refinement method that uses synthetic videos and an auxiliary objective to estimate finer-grained labels, achieving improved performance over existing methods without architectural changes.

Audio-Visual Event Localization (AVEL) is the task of temporally localizing and classifying \emph{audio-visual events}, i.e., events simultaneously visible and audible in a video. In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels (their presence/absence, but not their locations in time) are available as supervision for training. Our idea is to use a base model to estimate labels on the training data at a finer temporal resolution than at the video level and re-train the model with these labels. I.e., we determine the subset of labels for each \emph{slice} of frames in a training video by (i) replacing the frames outside the slice with those from a second video having no overlap in video-level labels, and (ii) feeding this synthetic video into the base model to extract labels for just the slice in question. To handle the out-of-distribution nature of our synthetic videos, we propose an auxiliary objective for the base model that induces more reliable predictions of the localized event labels as desired. Our three-stage pipeline outperforms several existing AVEL methods with no architectural changes and improves performance on a related weakly-supervised task as well.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes