CVFeb 21, 2025

ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval

arXiv:2502.15682v35 citationsh-index: 49CBMI
Originality Incremental advance
AI Analysis

This work addresses the need for better zero-shot generalization in image retrieval for applications like search engines, though it is incremental as it builds on existing models like CLIP.

The paper tackles the problem of improving text-to-image retrieval by introducing ELIP, a framework that enhances pre-trained vision-language models for re-ranking, achieving significant performance boosts on benchmarks including new out-of-distribution datasets.

The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple MLP mapping network, to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks. To train the architecture with limited computing resources, we develop a 'student friendly' best practice, involving global hard sample mining, and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution (OOD) benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different domains. The results demonstrate that ELIP significantly boosts CLIP/SigLIP/SigLIP-2 text-to-image retrieval performance and outperforms BLIP-2 on several benchmarks, as well as providing an easy means to adapt to OOD datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes