CVIRMar 24, 2025

Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval

arXiv:2503.19009v118 citationsh-index: 28CVPR
Originality Incremental advance
AI Analysis

It addresses text-to-video retrieval for applications like video search, but is incremental as it builds on existing late interaction techniques.

The paper tackles text-to-video retrieval by proposing Video-ColBERT, a method using fine-grained token-wise interaction, query and visual expansions, and a dual sigmoid loss, which improves performance on benchmarks compared to other bi-encoder methods.

In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes