CVMar 25, 2024

Composed Video Retrieval via Enriched Context and Discriminative Embeddings

arXiv:2403.16997v130 citationsh-index: 55Has CodeCVPR
Originality Incremental advance
AI Analysis

This work addresses the challenge of more sophisticated video search in large databases for computer vision applications, representing an incremental improvement over existing methods.

The paper tackles the problem of composed video retrieval by introducing a framework that uses detailed language descriptions to encode query-specific context and learns discriminative embeddings for better alignment, achieving state-of-the-art performance with gains up to 7% in recall@K=1 scores on three datasets.

Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases. Existing works predominantly rely on visual queries combined with modification text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art performance for both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in terms of recall@K=1 score. Our code, models, detailed language descriptions for WebViD-CoVR dataset are available at \url{https://github.com/OmkarThawakar/composed-video-retrieval}

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes