CVAIMay 23, 2024

Pre-Trained Vision-Language Models as Partial Annotators

arXiv:2406.18550v14 citationsh-index: 11Has Code
Originality Highly original
AI Analysis

This work addresses the challenge of reducing annotation costs for adapting pre-trained models in image classification, offering a practical solution for applications where labeled data is scarce but unlabeled data is abundant.

The paper tackles the problem of adapting pre-trained vision-language models to downstream tasks without requiring extensive manual annotation by proposing a 'pre-trained annotating - weakly-supervised learning' paradigm that uses CLIP to generate noisy partial labels and a collaborative consistency regularization algorithm for training. The method achieves performance far beyond zero-shot inference and outperforms other weakly supervised and few-shot fine-tuning methods, resulting in smaller deployed models.

Pre-trained vision-language models learn massive data to model unified representations of images and natural languages, which can be widely applied to downstream machine learning tasks. In addition to zero-shot inference, in order to better adapt pre-trained models to the requirements of downstream tasks, people usually use methods such as few-shot or parameter-efficient fine-tuning and knowledge distillation. However, annotating samples is laborious, while a large number of unlabeled samples can be easily obtained. In this paper, we investigate a novel "pre-trained annotating - weakly-supervised learning" paradigm for pre-trained model application and experiment on image classification tasks. Specifically, based on CLIP, we annotate image samples with multiple prompt templates to obtain multiple candidate labels to form the noisy partial label dataset, and design a collaborative consistency regularization algorithm to solve this problem. Our method simultaneously trains two neural networks, which collaboratively purify training labels for each other and obtain pseudo-labels for self-training, while adopting prototypical similarity alignment and noisy supervised contrastive learning to optimize model representation. In experiments, our method achieves performances far beyond zero-shot inference without introducing additional label information, and outperforms other weakly supervised learning and few-shot fine-tuning methods, and obtains smaller deployed models. Our code is available at: \url{https://anonymous.4open.science/r/Co-Reg-8CF9}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes