CVMay 17, 2022

A CLIP-Hitchhiker's Guide to Long Video Retrieval

Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman

arXiv:2205.08508v125.176 citationsh-index: 188Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of effective video retrieval for researchers and practitioners, though it is incremental as it builds on existing CLIP-based approaches.

The paper tackles the problem of adapting image-text models like CLIP for long video retrieval by improving temporal aggregation, finding that a weighted-mean baseline via query-scoring outperforms prior methods and mean-pooling, achieving state-of-the-art results on benchmarks.

Our goal in this paper is the adaptation of image-text models for long video retrieval. Recent works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP, effectively hitchhiking on the image-text representation for video tasks. However, there has been limited success in learning temporal aggregation that outperform mean-pooling the image-level representations extracted per frame by CLIP. We find that the simple yet effective baseline of weighted-mean of frame embeddings via query-scoring is a significant improvement above all prior temporal modelling attempts and mean-pooling. In doing so, we provide an improved baseline for others to compare to and demonstrate state-of-the-art performance of this simple baseline on a suite of long video retrieval benchmarks.

View on arXiv PDF Code

Similar