CVMar 18, 2021

Decoupled Spatial Temporal Graphs for Generic Visual Grounding

arXiv:2103.10191v15 citations
Originality Incremental advance
AI Analysis

This addresses a more practical and challenging problem of grounding objects in untrimmed videos for real-world applications, though it is incremental as it builds on existing visual grounding methods.

The paper tackles generic visual grounding in untrimmed videos by proposing DSTG, which decouples spatial and temporal representations and uses contrastive learning to improve discriminativeness and consistency, achieving state-of-the-art results on datasets like Charades-STA, ActivityNet-Caption, and a new GVG dataset.

Visual grounding is a long-lasting problem in vision-language understanding due to its diversity and complexity. Current practices concentrate mostly on performing visual grounding in still images or well-trimmed video clips. This work, on the other hand, investigates into a more general setting, generic visual grounding, aiming to mine all the objects satisfying the given expression, which is more challenging yet practical in real-world scenarios. Importantly, grounding results are expected to accurately localize targets in both space and time. Whereas, it is tricky to make trade-offs between the appearance and motion features. In real scenarios, model tends to fail in distinguishing distractors with similar attributes. Motivated by these considerations, we propose a simple yet effective approach, named DSTG, which commits to 1) decomposing the spatial and temporal representations to collect all-sided cues for precise grounding; 2) enhancing the discriminativeness from distractors and the temporal consistency with a contrastive learning routing strategy. We further elaborate a new video dataset, GVG, that consists of challenging referring cases with far-ranging videos. Empirical experiments well demonstrate the superiority of DSTG over state-of-the-art on Charades-STA, ActivityNet-Caption and GVG datasets. Code and dataset will be made available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes