CVJul 21, 2022

LocVTP: Video-Text Pre-training for Temporal Localization

arXiv:2207.10362v173 citationsh-index: 26Has Code
Originality Highly original
AI Analysis

This addresses the problem of limited transferability of video-text pre-training to temporal localization tasks for researchers and practitioners in video understanding.

The paper tackles the incompatibility of existing Video-Text Pre-training methods with localization tasks by proposing LocVTP, which achieves state-of-the-art performance on both retrieval-based and localization-based tasks across four downstream tasks and six datasets.

Video-Text Pre-training (VTP) aims to learn transferable representations for various downstream tasks from large-scale web videos. To date, almost all existing VTP methods are limited to retrieval-based downstream tasks, e.g., video retrieval, whereas their transfer potentials on localization-based tasks, e.g., temporal grounding, are under-explored. In this paper, we experimentally analyze and demonstrate the incompatibility of current VTP methods with localization tasks, and propose a novel Localization-oriented Video-Text Pre-training framework, dubbed as LocVTP. Specifically, we perform the fine-grained contrastive alignment as a complement to the coarse-grained one by a clip-word correspondence discovery scheme. To further enhance the temporal reasoning ability of the learned feature, we propose a context projection head and a temporal aware contrastive loss to perceive the contextual relationships. Extensive experiments on four downstream tasks across six datasets demonstrate that our LocVTP achieves state-of-the-art performance on both retrieval-based and localization-based tasks. Furthermore, we conduct comprehensive ablation studies and thorough analyses to explore the optimum model designs and training strategies.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes