CVAIAug 15, 2025

Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering

arXiv:2508.11272v1h-index: 2
Originality Incremental advance
AI Analysis

This work addresses the problem of improving retrieval accuracy in supervised composed image retrieval for AI and computer vision applications, representing an incremental advancement over existing methods.

The paper tackles the challenge of supervised composed image retrieval by proposing a framework with a Pyramid Matching Model and Training-Free Refinement, which enhances visual understanding and injects representations from Chain-of-Thought data into LVLMs, achieving state-of-the-art results on benchmarks.

Composed Image Retrieval (CIR) presents a significant challenge as it requires jointly understanding a reference image and a modified textual instruction to find relevant target images. Some existing methods attempt to use a two-stage approach to further refine retrieval results. However, this often requires additional training of a ranking model. Despite the success of Chain-of-Thought (CoT) techniques in reducing training costs for language models, their application in CIR tasks remains limited -- compressing visual information into text or relying on elaborate prompt designs. Besides, existing works only utilize it for zero-shot CIR, as it is challenging to achieve satisfactory results in supervised CIR with a well-trained model. In this work, we proposed a framework that includes the Pyramid Matching Model with Training-Free Refinement (PMTFR) to address these challenges. Through a simple but effective module called Pyramid Patcher, we enhanced the Pyramid Matching Model's understanding of visual information at different granularities. Inspired by representation engineering, we extracted representations from COT data and injected them into the LVLMs. This approach allowed us to obtain refined retrieval scores in the Training-Free Refinement paradigm without relying on explicit textual reasoning, further enhancing performance. Extensive experiments on CIR benchmarks demonstrate that PMTFR surpasses state-of-the-art methods in supervised CIR tasks. The code will be made public.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes