CVCLSep 28, 2021

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

arXiv:2109.14084v2781 citationsHas Code
Originality Highly original
AI Analysis

This addresses the challenge of understanding video content without labeled data for downstream tasks, which is incremental as it builds on existing contrastive learning techniques.

The paper tackles the problem of zero-shot video-text understanding by introducing VideoCLIP, a contrastive pre-training method that achieves state-of-the-art performance on tasks like text-video retrieval and VideoQA, sometimes outperforming supervised approaches.

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes