CVMar 12

Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval

arXiv:2602.0081362.9h-index: 7
AI Analysis

This addresses the challenge of retrieving images based on implicit mental constructs for applications like search and recommendation, offering a novel approach but is incremental in improving zero-shot CIR methods.

The paper tackles the problem of Composed Image Retrieval (CIR) by directly generating the 'mental image' from multimodal queries using a Large Multimodal Model, instead of relying on textual descriptions, and matches it with synthetic counterparts of database images to overcome domain gaps. It achieves state-of-the-art performance on challenging benchmarks as a training-free zero-shot method.

Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a ''mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this ''mental image'' is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search for the target image. In contrast, we address CIR from first principles by directly generating the ''mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a ''mental image'' for a given multimodal query and propose to use this ''mental image'' to search for the target image. As the ''mental image'' has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm'', where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes