CVJul 28, 2025

Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting

Alexey Kravets, Da Chen, Vinay P. Namboodiri

arXiv:2507.20834v1h-index: 3

Originality Incremental advance

AI Analysis

This work addresses a critical evaluation flaw in few-shot CLIP classification for researchers, providing more realistic benchmarks and an improved method, though it is incremental in refining existing techniques.

The authors identified that existing few-shot CLIP benchmarks are partially transductive due to dataset overlap, and by introducing an unlearning technique to create true inductive baselines, they found performance drops of -55% on average across 13 baselines, while proposing an improved method that achieves state-of-the-art results in 5880 experiments.

CLIP is a foundational model with transferable classification performance in the few-shot setting. Several methods have shown improved performance of CLIP using few-shot examples. However, so far, all these techniques have been benchmarked using standard few-shot datasets. We argue that this mode of evaluation does not provide a true indication of the inductive generalization ability using few-shot examples. As most datasets have been seen by the CLIP model, the resultant setting can be termed as partially transductive. To solve this, we propose a pipeline that uses an unlearning technique to obtain true inductive baselines. In this new inductive setting, the methods show a significant drop in performance (-55% on average among 13 baselines with multiple datasets). We validate the unlearning technique using oracle baselines. An improved few-shot classification technique is proposed that consistently obtains state-of-the-art performance over 13 other recent baseline methods on a comprehensive analysis with 5880 experiments - varying the datasets, differing number of few-shot examples, unlearning setting, and with different seeds. Thus, we identify the issue with the evaluation of CLIP-based few-shot classification, provide a solution using unlearning, propose new benchmarks, and provide an improved method.

View on arXiv PDF

Similar