CVOct 29, 2024

Active Learning for Vision-Language Models

arXiv:2410.22187v117 citationsh-index: 6WACV
Originality Incremental advance
AI Analysis

This work addresses the problem of improving zero-shot learning for vision-language models in computer vision, representing an incremental advancement in active learning methods.

The paper tackles the performance gap between pre-trained vision-language models and supervised models by proposing an active learning framework that selects informative samples for annotation, enhancing zero-shot classification performance on image datasets.

Pre-trained vision-language models (VLMs) like CLIP have demonstrated impressive zero-shot performance on a wide range of downstream computer vision tasks. However, there still exists a considerable performance gap between these models and a supervised deep model trained on a downstream dataset. To bridge this gap, we propose a novel active learning (AL) framework that enhances the zero-shot classification performance of VLMs by selecting only a few informative samples from the unlabeled data for annotation during training. To achieve this, our approach first calibrates the predicted entropy of VLMs and then utilizes a combination of self-uncertainty and neighbor-aware uncertainty to calculate a reliable uncertainty measure for active sample selection. Our extensive experiments show that the proposed approach outperforms existing AL approaches on several image classification datasets, and significantly enhances the zero-shot performance of VLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes