CVJun 11, 2023

Self-Enhancement Improves Text-Image Retrieval in Foundation Visual-Language Models

arXiv:2306.06691v13 citationsh-index: 54Has Code
Originality Incremental advance
AI Analysis

This addresses a domain-specific retrieval issue for users of cross-modal foundation models, but it is incremental as it builds on existing models like CLIP.

The paper tackles the problem of domain-specific text-image retrieval where foundation models fail to focus on key attributes, proposing a self-enhancement framework that improves performance without additional samples, achieving a salient improvement over baselines in a challenge.

The emergence of cross-modal foundation models has introduced numerous approaches grounded in text-image retrieval. However, on some domain-specific retrieval tasks, these models fail to focus on the key attributes required. To address this issue, we propose a self-enhancement framework, A^{3}R, based on the CLIP-ViT/G-14, one of the largest cross-modal models. First, we perform an Attribute Augmentation strategy to enrich the textual description for fine-grained representation before model learning. Then, we propose an Adaption Re-ranking method to unify the representation space of textual query and candidate images and re-rank candidate images relying on the adapted query after model learning. The proposed framework is validated to achieve a salient improvement over the baseline and other teams' solutions in the cross-modal image retrieval track of the 1st foundation model challenge without introducing any additional samples. The code is available at \url{https://github.com/CapricornGuang/A3R}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes