Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
It addresses text-to-video retrieval for applications like video search, but is incremental as it builds on existing late interaction techniques.
The paper tackles text-to-video retrieval by proposing Video-ColBERT, a method using fine-grained token-wise interaction, query and visual expansions, and a dual sigmoid loss, which improves performance on benchmarks compared to other bi-encoder methods.
In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.