CVAug 28, 2023

CoVR-2: Automatic Data Construction for Composed Video Retrieval

DeepMind
arXiv:2308.14746v457 citationsh-index: 151
Originality Incremental advance
AI Analysis

This work addresses the scalability issue in composed retrieval tasks for researchers and practitioners by automating dataset construction, though it is incremental as it builds on existing methods like BLIP-2.

The authors tackled the problem of expensive manual annotation for composed image retrieval by proposing an automatic dataset creation method that generates triplets from video-caption pairs, expanding the task to composed video retrieval and producing datasets like WebVid-CoVR with 1.6 million triplets, and their model achieved improved state-of-the-art performance in zero-shot setups on benchmarks such as CIRR, FashionIQ, and CIRCO.

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes