CVAug 29, 2025

Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders

arXiv:2509.00177v11 citationsh-index: 28Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses a domain-specific problem in computer vision for researchers and practitioners, offering an incremental improvement by combining existing models.

The paper tackles the problem of text-to-image retrieval for category-level queries by bridging the modality gap between text and images. It proposes a method using diffusion models and vision encoders, showing consistent performance improvements over text-only retrieval methods.

This work explores text-to-image retrieval for queries that specify or describe a semantic category. While vision-and-language models (VLMs) like CLIP offer a straightforward open-vocabulary solution, they map text and images to distant regions in the representation space, limiting retrieval performance. To bridge this modality gap, we propose a two-step approach. First, we transform the text query into a visual query using a generative diffusion model. Then, we estimate image-to-image similarity with a vision model. Additionally, we introduce an aggregation network that combines multiple generated images into a single vector representation and fuses similarity scores across both query modalities. Our approach leverages advancements in vision encoders, VLMs, and text-to-image generation models. Extensive evaluations show that it consistently outperforms retrieval methods relying solely on text queries. Source code is available at: https://github.com/faixan-khan/cletir

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes