CVIROct 31, 2024

Modality and Task Adaptation for Enhanced Zero-shot Composed Image Retrieval

arXiv:2410.23736v21 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work improves retrieval accuracy for vision-language tasks, though it appears incremental as it builds on existing inversion-based methods.

The paper tackles the problem of zero-shot composed image retrieval by addressing task and modality discrepancies in inversion-based methods, achieving state-of-the-art performance across four benchmarks.

As a challenging vision-language task, Zero-Shot Composed Image Retrieval (ZS-CIR) is designed to retrieve target images using bi-modal (image+text) queries. Typical ZS-CIR methods employ an inversion network to generate pseudo-word tokens that effectively represent the input semantics. However, the inversion-based methods suffer from two inherent issues: First, the task discrepancy exists because inversion training and CIR inference involve different objectives. Second, the modality discrepancy arises from the input feature distribution mismatch between training and inference. To this end, we propose a lightweight post-hoc framework, consisting of two components: (1) A new text-anchored triplet construction pipeline leverages a large language model (LLM) to transform a standard image-text dataset into a triplet dataset, where a textual description serves as the target of each triplet. (2) The MoTa-Adapter, a novel parameter-efficient fine-tuning method, adapts the dual encoder to the CIR task using our constructed triplet data. Specifically, on the text side, multiple sets of learnable task prompts are integrated via a Mixture-of-Experts (MoE) layer to capture task-specific priors and handle different types of modifications. On the image side, MoTa-Adapter modulates the inversion network's input to better match the downstream text encoder. In addition, an entropy-based optimization strategy is proposed to assign greater weight to challenging samples, thus ensuring efficient adaptation. Experiments show that, with the incorporation of our proposed components, inversion-based methods achieve significant improvements, reaching state-of-the-art performance across four widely-used benchmarks. All data and code will be made publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes