CVDec 31, 2023

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan

arXiv:2401.00901v220.236 citationsh-index: 28Has CodeCVPR

Originality Incremental advance

AI Analysis

It addresses a critical limitation in video grounding for handling diverse linguistic and visual concepts, though it is incremental as it builds on existing foundational models.

The paper tackles the problem of open-vocabulary spatio-temporal video grounding by introducing a model that leverages pre-trained representations to bridge semantic gaps, achieving state-of-the-art results in closed-set evaluations and outperforming recent methods by 4.88 m_vIoU and 1.83% accuracy in open-vocabulary scenarios.

Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task. Unlike prevalent closed-set approaches that struggle with open-vocabulary scenarios due to limited training data and predefined vocabularies, our model leverages pre-trained representations from foundational spatial grounding models. This empowers it to effectively bridge the semantic gap between natural language and diverse visual content, achieving strong performance in closed-set and open-vocabulary settings. Our contributions include a novel spatio-temporal video grounding model, surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demonstrating superior performance in open-vocabulary scenarios. Notably, the proposed model outperforms state-of-the-art methods in closed-set settings on VidSTG (Declarative and Interrogative) and HC-STVG (V1 and V2) datasets. Furthermore, in open-vocabulary evaluations on HC-STVG V1 and YouCook-Interactions, our model surpasses the recent best-performing models by $4.88$ m_vIoU and $1.83\%$ accuracy, demonstrating its efficacy in handling diverse linguistic and visual concepts for improved video understanding. Our codes will be publicly released.

View on arXiv PDF

Similar