CVAug 31, 2019

WSLLN: Weakly Supervised Natural Language Localization Networks

arXiv:1909.00239v11016 citations
Originality Incremental advance
AI Analysis

This reduces annotation costs for video-language localization tasks, but is incremental as it builds on weakly supervised methods.

The paper tackles the problem of detecting events in long, untrimmed videos using language queries without needing temporal annotations, and achieves state-of-the-art performance on ActivityNet Captions and DiDeMo datasets.

We propose weakly supervised language localization networks (WSLLN) to detect events in long, untrimmed videos given language queries. To learn the correspondence between visual segments and texts, most previous methods require temporal coordinates (start and end times) of events for training, which leads to high costs of annotation. WSLLN relieves the annotation burden by training with only video-sentence pairs without accessing to temporal locations of events. With a simple end-to-end structure, WSLLN measures segment-text consistency and conducts segment selection (conditioned on the text) simultaneously. Results from both are merged and optimized as a video-sentence matching problem. Experiments on ActivityNet Captions and DiDeMo demonstrate that WSLLN achieves state-of-the-art performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes