CVAIApr 8, 2024

Transductive Zero-Shot and Few-Shot CLIP

arXiv:2405.18437v141 citationsh-index: 51Has CodeCVPR
Originality Highly original
AI Analysis

This work addresses a bottleneck in adapting vision-language models like CLIP for practical scenarios where unlabeled data is available in batches, offering significant accuracy gains for tasks like image classification.

The paper tackles the problem of transductive zero-shot and few-shot classification with CLIP by performing joint inference across batches of unlabeled queries, achieving near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance and outperforming state-of-the-art methods in few-shot settings.

Transductive inference has been widely investigated in few-shot image classification, but completely overlooked in the recent, fast growing literature on adapting vision-langage models like CLIP. This paper addresses the transductive zero-shot and few-shot CLIP classification challenge, in which inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently. We initially construct informative vision-text probability features, leading to a classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM), our optimization-based classification objective models the data probability distribution for each class using a Dirichlet law. The minimization problem is then tackled with a novel block Majorization-Minimization algorithm, which simultaneously estimates the distribution parameters and class assignments. Extensive numerical experiments on 11 datasets underscore the benefits and efficacy of our batch inference approach.On zero-shot tasks with test batches of 75 samples, our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance. Additionally, we outperform state-of-the-art methods in the few-shot setting. The code is available at: https://github.com/SegoleneMartin/transductive-CLIP.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes