CVFeb 24, 2021

A Straightforward Framework For Video Retrieval Using CLIP

arXiv:2102.12443v2140 citations
AI Analysis

This work addresses video retrieval for users needing efficient matching without annotations, but it is incremental as it extends an existing model to a new domain.

The authors tackled video retrieval by applying the CLIP language-image model to obtain video representations without user annotations, achieving state-of-the-art results on MSR-VTT and MSVD benchmarks.

Video Retrieval is a challenging task where a text query is matched to a video or vice versa. Most of the existing approaches for addressing such a problem rely on annotations made by the users. Although simple, this approach is not always feasible in practice. In this work, we explore the application of the language-image model, CLIP, to obtain video representations without the need for said annotations. This model was explicitly trained to learn a common space where images and text can be compared. Using various techniques described in this document, we extended its application to videos, obtaining state-of-the-art results on the MSR-VTT and MSVD benchmarks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes