CVJul 19, 2024

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

arXiv:2407.14117v15 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in low-shot CLIP adaptation for vision-language tasks, offering incremental improvements.

The paper tackles the problem of biased perception of local details in low-shot CLIP adaptation by proposing a Visual Content Refinement (VCR) method that decomposes images into scales, selects views with max prediction margins, and merges them to create robust representations, achieving about 2% average improvement over the baseline Tip-Adapter on few-shot classification tasks.

Recent adaptations can boost the low-shot capability of Contrastive Vision-Language Pre-training (CLIP) by effectively facilitating knowledge transfer. However, these adaptation methods are usually operated on the global view of an input image, and thus biased perception of partial local details of the image. To solve this problem, we propose a Visual Content Refinement (VCR) before the adaptation calculation during the test stage. Specifically, we first decompose the test image into different scales to shift the feature extractor's attention to the details of the image. Then, we select the image view with the max prediction margin in each scale to filter out the noisy image views, where the prediction margins are calculated from the pre-trained CLIP model. Finally, we merge the content of the aforementioned selected image views based on their scales to construct a new robust representation. Thus, the merged content can be directly used to help the adapter focus on both global and local parts without any extra training parameters. We apply our method to 3 popular low-shot benchmark tasks with 13 datasets and achieve a significant improvement over state-of-the-art methods. For example, compared to the baseline (Tip-Adapter) on the few-shot classification task, our method achieves about 2\% average improvement for both training-free and training-need settings.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes