CVFeb 24, 2021

A Straightforward Framework For Video Retrieval Using CLIP

Jesús Andrés Portillo-Quintero, José Carlos Ortiz-Bayliss, Hugo Terashima-Marín

arXiv:2102.12443v224.2140 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses video retrieval for users needing efficient matching without annotations, but it is incremental as it extends an existing model to a new domain.

The authors tackled video retrieval by applying the CLIP language-image model to obtain video representations without user annotations, achieving state-of-the-art results on MSR-VTT and MSVD benchmarks.

Video Retrieval is a challenging task where a text query is matched to a video or vice versa. Most of the existing approaches for addressing such a problem rely on annotations made by the users. Although simple, this approach is not always feasible in practice. In this work, we explore the application of the language-image model, CLIP, to obtain video representations without the need for said annotations. This model was explicitly trained to learn a common space where images and text can be compared. Using various techniques described in this document, we extended its application to videos, obtaining state-of-the-art results on the MSR-VTT and MSVD benchmarks.

View on arXiv PDF Code

Similar