CVDec 1, 2025

Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval

arXiv:2512.01636v1h-index: 4
Originality Highly original
AI Analysis

This addresses the need for fine-grained visual search without costly annotations, offering a data-efficient solution for applications in image retrieval.

The paper tackles the problem of zero-shot composed image retrieval by proposing Fusion-Diff, a generative editing framework that bridges the vision-language modality gap, achieving state-of-the-art performance on benchmarks like CIRR, FashionIQ, and CIRCO with only 200K synthetic samples.

Composed Image Retrieval (CIR) enables fine-grained visual search by combining a reference image with a textual modification. While supervised CIR methods achieve high accuracy, their reliance on costly triplet annotations motivates zero-shot solutions. The core challenge in zero-shot CIR (ZS-CIR) stems from a fundamental dilemma: existing text-centric or diffusion-based approaches struggle to effectively bridge the vision-language modality gap. To address this, we propose Fusion-Diff, a novel generative editing framework with high effectiveness and data efficiency designed for multimodal alignment. First, it introduces a multimodal fusion feature editing strategy within a joint vision-language (VL) space, substantially narrowing the modality gap. Second, to maximize data efficiency, the framework incorporates a lightweight Control-Adapter, enabling state-of-the-art performance through fine-tuning on only a limited-scale synthetic dataset of 200K samples. Extensive experiments on standard CIR benchmarks (CIRR, FashionIQ, and CIRCO) demonstrate that Fusion-Diff significantly outperforms prior zero-shot approaches. We further enhance the interpretability of our model by visualizing the fused multimodal representations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes