CVJan 20

PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval

arXiv:2601.13797v11 citationsh-index: 14
Originality Incremental advance
AI Analysis

This addresses the challenge of efficiently retrieving videos based on combined video and text queries for applications like video search, with incremental improvements over existing methods.

The paper tackles the problem of composed video retrieval by introducing PREGEN, an efficient framework that pairs a frozen pre-trained vision-language model with a lightweight encoder, achieving state-of-the-art results with gains of +27.23 and +69.59 in Recall@1 on benchmarks.

Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text. Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs), either using outdated architectures or requiring computationally expensive fine-tuning and slow caption generation. We introduce PREGEN (PRE GENeration extraction), an efficient and powerful CoVR framework that overcomes these limitations. Our approach uniquely pairs a frozen, pre-trained VLM with a lightweight encoding model, eliminating the need for any VLM fine-tuning. We feed the query video and modifying text into the VLM and extract the hidden state of the final token from each layer. A simple encoder is then trained on these pooled representations, creating a semantically rich and compact embedding for retrieval. PREGEN significantly advances the state of the art, surpassing all prior methods on standard CoVR benchmarks with substantial gains in Recall@1 of +27.23 and +69.59. Our method demonstrates robustness across different VLM backbones and exhibits strong zero-shot generalization to more complex textual modifications, highlighting its effectiveness and semantic capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes