CV MMMar 25, 2025

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

Haoqiang Lin, Haokun Wen, Xuemeng Song, Meng Liu, Yupeng Hu, Liqiang Nie

arXiv:2503.19296v125.243 citationsh-index: 13Has CodeSIGIR

Originality Incremental advance

AI Analysis

This work addresses the challenge of retrieving images without annotated training data, which is important for applications requiring flexible image search, though it appears incremental over prior ZS-CIR methods.

The paper tackles the problem of zero-shot composed image retrieval (ZS-CIR) by proposing a fine-grained textual inversion network (FTI4CIR) that maps images into multiple pseudo-word tokens instead of a single token, achieving superior performance on three benchmark datasets.

Composed Image Retrieval (CIR) allows users to search target images with a multimodal query, comprising a reference image and a modification text that describes the user's modification demand over the reference image. Nevertheless, due to the expensive labor cost of training data annotation, recent researchers have shifted to the challenging task of zero-shot CIR (ZS-CIR), which targets fulfilling CIR without annotated triplets. The pioneer ZS-CIR studies focus on converting the CIR task into a standard text-to-image retrieval task by pre-training a textual inversion network that can map a given image into a single pseudo-word token. Despite their significant progress, their coarse-grained textual inversion may be insufficient to capture the full content of the image accurately. To overcome this issue, in this work, we propose a novel Fine-grained Textual Inversion Network for ZS-CIR, named FTI4CIR. In particular, FTI4CIR comprises two main components: fine-grained pseudo-word token mapping and tri-wise caption-based semantic regularization. The former maps the image into a subject-oriented pseudo-word token and several attribute-oriented pseudo-word tokens to comprehensively express the image in the textual form, while the latter works on jointly aligning the fine-grained pseudo-word tokens to the real-word token embedding space based on a BLIP-generated image caption template. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our proposed method.

View on arXiv PDF Code

Similar