CV CLDec 4, 2021

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Longtian Qiu, Renrui Zhang, Ziyu Guo, Ziyao Zeng, Zilu Guo, Yafeng Li, Guangnan Zhang

arXiv:2112.02399v318.756 citations

Originality Incremental advance

AI Analysis

This addresses the adaptation challenge for vision-language models in downstream applications, representing an incremental improvement.

The paper tackles the problem of CLIP's sub-optimal image-text alignment on downstream tasks due to semantic gaps, proposing VT-CLIP to enhance it with visual-guided texts, resulting in improved performance demonstrated on 11 classification datasets in few-shot settings.

Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning. However, due to the semantic gap within datasets, CLIP's pre-trained image-text alignment becomes sub-optimal on downstream tasks, which severely harms its transferring performance. To better adapt the cross-modality embedding space, we propose to enhance CLIP via Visual-guided Texts, named VT-CLIP. Specifically, we guide textual features of different categories to adaptively explore informative regions on the image and aggregate visual features by attention mechanisms. In this way, the texts become visual-guided, namely, more semantically correlated with downstream images, which greatly benefits the category-wise matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.

View on arXiv PDF

Similar