CV CLAug 22, 2023

Unsupervised Prototype Adapter for Vision-Language Models

Yi Zhang, Ce Zhang, Xueting Hu, Zhihai He

ETH Zurich

arXiv:2308.11507v28.410 citationsh-index: 59

Originality Incremental advance

AI Analysis

This work addresses the scalability issue in adapting vision-language models for downstream tasks by reducing reliance on annotated data, which is incremental as it builds on existing fine-tuning approaches.

The paper tackles the problem of fine-tuning vision-language models without annotated data by proposing an unsupervised prototype adapter that automatically selects confident samples and generates class prototypes, achieving performance gains over existing methods like 8-shot CoOp, 8-shot Tip-Adapter, and UPL in image recognition and domain generalization tasks.

Recently, large-scale pre-trained vision-language models (e.g. CLIP and ALIGN) have demonstrated remarkable effectiveness in acquiring transferable visual representations. To leverage the valuable knowledge encoded within these models for downstream tasks, several fine-tuning approaches, including prompt tuning methods and adapter-based methods, have been developed to adapt vision-language models effectively with supervision. However, these methods rely on the availability of annotated samples, which can be labor-intensive and time-consuming to acquire, thus limiting scalability. To address this issue, in this work, we design an unsupervised fine-tuning approach for vision-language models called Unsupervised Prototype Adapter (UP-Adapter). Specifically, for the unannotated target datasets, we leverage the text-image aligning capability of CLIP to automatically select the most confident samples for each class. Utilizing these selected samples, we generate class prototypes, which serve as the initialization for the learnable prototype model. After fine-tuning, the prototype model prediction is combined with the original CLIP's prediction by a residual connection to perform downstream recognition tasks. Our extensive experimental results on image recognition and domain generalization show that the proposed unsupervised method outperforms 8-shot CoOp, 8-shot Tip-Adapter, and also the state-of-the-art UPL method by large margins.

View on arXiv PDF

Similar