CVDec 26, 2025

Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models

arXiv:2512.21860v1h-index: 6
Originality Incremental advance
AI Analysis

This addresses the challenge of extracting condition-specific image features for applications in image retrieval or analysis, though it is incremental as it builds on existing vision-language models.

The paper tackles the problem of generating conditional image embeddings that focus on specific textual conditions (e.g., color, genre) by proposing DIOR, a training-free method that uses a large vision-language model to produce these embeddings, and it outperforms existing training-free baselines like CLIP and even methods requiring additional training in experiments.

Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (e.g., color, genre), which has been a challenging problem. Although recent vision foundation models, such as CLIP, offer rich representations of images, they are not designed to focus on a specified condition. In this paper, we propose DIOR, a method that leverages a large vision-language model (LVLM) to generate conditional image embeddings. DIOR is a training-free approach that prompts the LVLM to describe an image with a single word related to a given condition. The hidden state vector of the LVLM's last token is then extracted as the conditional image embedding. DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional image similarity tasks demonstrate that DIOR outperforms existing training-free baselines, including CLIP. Furthermore, DIOR achieves superior performance compared to methods that require additional training across multiple settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes