VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
This addresses the challenge of understanding video content without labeled data for downstream tasks, which is incremental as it builds on existing contrastive learning techniques.
The paper tackles the problem of zero-shot video-text understanding by introducing VideoCLIP, a contrastive pre-training method that achieves state-of-the-art performance on tasks like text-video retrieval and VideoQA, sometimes outperforming supervised approaches.
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.