CVMMApr 1, 2025

Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data

arXiv:2504.00812v23 citationsh-index: 16IJCNN
Originality Incremental advance
AI Analysis

This reduces annotation costs for CIR, enabling scalable training, though it is incremental as it builds on existing foundational models.

The paper tackles the high cost of human-annotated data for Composed Image Retrieval (CIR) by using large language models to generate training data from unlabeled images, achieving state-of-the-art zero-shot performance on CIRR and FashionIQ datasets and showing improved results with more generated data.

Composed Image Retrieval (CIR) is the task of retrieving images matching a reference image augmented with a text, where the text describes changes to the reference image in natural language. Traditionally, models designed for CIR have relied on triplet data containing a reference image, reformulation text, and a target image. However, curating such triplet data often necessitates human intervention, leading to prohibitive costs. This challenge has hindered the scalability of CIR model training even with the availability of abundant unlabeled data. With the recent advances in foundational models, we advocate a shift in the CIR training paradigm where human annotations can be efficiently replaced by large language models (LLMs). Specifically, we demonstrate the capability of large captioning and language models in efficiently generating data for CIR only relying on unannotated image collections. Additionally, we introduce an embedding reformulation architecture that effectively combines image and text modalities. Our model, named InstructCIR, outperforms state-of-the-art methods in zero-shot composed image retrieval on CIRR and FashionIQ datasets. Furthermore, we demonstrate that by increasing the amount of generated data, our zero-shot model gets closer to the performance of supervised baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes