CVApr 2, 2024

SnAG: Scalable and Accurate Video Grounding

arXiv:2404.02257v240 citationsh-index: 10CVPR
Originality Incremental advance
AI Analysis

This addresses the scalability bottleneck in video grounding for vision-language learning, offering a significant improvement over existing methods.

The paper tackled the problem of scaling video grounding to long videos with many text queries by analyzing cross-modal fusion, leading to a method that is 43% more accurate and 1.5x faster than the state-of-the-art on a challenging dataset.

Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover, it leads us to a novel, video-centric sampling scheme for efficient training. Based on these findings, we present SnAG, a simple baseline for scalable and accurate video grounding. Without bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a state of the art for long-form video grounding on the challenging MAD dataset, while achieving highly competitive results on short videos.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes