CVLGMay 27, 2025

ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval

arXiv:2505.20764v14 citationsh-index: 12Has CodeCVPR
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in image retrieval for AI applications, offering incremental improvements over existing methods.

The paper tackles the problem of composed image retrieval (CIR) by introducing ConText-CIR, a framework that improves representation of text modifications and query images, achieving new state-of-the-art results on benchmarks like CIRR and CIRCO in both supervised and zero-shot settings.

Composed image retrieval (CIR) is the task of retrieving a target image specified by a query image and a relative text that describes a semantic modification to the query image. Existing methods in CIR struggle to accurately represent the image and the text modification, resulting in subpar performance. To address this limitation, we introduce a CIR framework, ConText-CIR, trained with a Text Concept-Consistency loss that encourages the representations of noun phrases in the text modification to better attend to the relevant parts of the query image. To support training with this loss function, we also propose a synthetic data generation pipeline that creates training data from existing CIR datasets or unlabeled images. We show that these components together enable stronger performance on CIR tasks, setting a new state-of-the-art in composed image retrieval in both the supervised and zero-shot settings on multiple benchmark datasets, including CIRR and CIRCO. Source code, model checkpoints, and our new datasets are available at https://github.com/mvrl/ConText-CIR.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes