CVDec 31, 2023

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

arXiv:2401.00901v236 citationsh-index: 28Has CodeCVPR
Originality Incremental advance
AI Analysis

It addresses a critical limitation in video grounding for handling diverse linguistic and visual concepts, though it is incremental as it builds on existing foundational models.

The paper tackles the problem of open-vocabulary spatio-temporal video grounding by introducing a model that leverages pre-trained representations to bridge semantic gaps, achieving state-of-the-art results in closed-set evaluations and outperforming recent methods by 4.88 m_vIoU and 1.83% accuracy in open-vocabulary scenarios.

Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task. Unlike prevalent closed-set approaches that struggle with open-vocabulary scenarios due to limited training data and predefined vocabularies, our model leverages pre-trained representations from foundational spatial grounding models. This empowers it to effectively bridge the semantic gap between natural language and diverse visual content, achieving strong performance in closed-set and open-vocabulary settings. Our contributions include a novel spatio-temporal video grounding model, surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demonstrating superior performance in open-vocabulary scenarios. Notably, the proposed model outperforms state-of-the-art methods in closed-set settings on VidSTG (Declarative and Interrogative) and HC-STVG (V1 and V2) datasets. Furthermore, in open-vocabulary evaluations on HC-STVG V1 and YouCook-Interactions, our model surpasses the recent best-performing models by $4.88$ m_vIoU and $1.83\%$ accuracy, demonstrating its efficacy in handling diverse linguistic and visual concepts for improved video understanding. Our codes will be publicly released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes