CV IRMar 24, 2025

Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval

Arun Reddy, Alexander Martin, Eugene Yang, Andrew Yates, Kate Sanders, Kenton Murray, Reno Kriz, Celso M. de Melo, Benjamin Van Durme, Rama Chellappa

arXiv:2503.19009v118 citationsh-index: 28CVPR

Originality Incremental advance

AI Analysis

It addresses text-to-video retrieval for applications like video search, but is incremental as it builds on existing late interaction techniques.

The paper tackles text-to-video retrieval by proposing Video-ColBERT, a method using fine-grained token-wise interaction, query and visual expansions, and a dual sigmoid loss, which improves performance on benchmarks compared to other bi-encoder methods.

In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.

View on arXiv PDF

Similar