CVLGSep 26, 2023

Noise-Tolerant Few-Shot Unsupervised Adapter for Vision-Language Models

arXiv:2309.14928v31 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the scalability and generalizability issues in visual recognition tasks by enabling few-shot learning without labels, though it is incremental as it builds on existing adapter methods.

The paper tackles the problem of requiring labeled target samples for few-shot adaptation in vision-language models by proposing NtUA, a noise-tolerant unsupervised adapter that uses few unlabelled samples, achieving superior performance across multiple benchmarks.

Recent advances in large-scale vision-language models have achieved impressive performance in various zero-shot image classification tasks. While prior studies have demonstrated significant improvements by introducing few-shot labelled target samples, they still require labelling of target samples, which greatly degrades their scalability and generalizability while handling various visual recognition tasks. We design NtUA, a Noise-tolerant Unsupervised Adapter that allows the learning of effective target models with few unlabelled target samples. NtUA works as a key-value cache that formulates visual features and predicted pseudo-labels of the few unlabelled target samples as key-value pairs. It consists of two complementary designs. The first is adaptive cache formation that combats pseudo-label noises by weighting the key-value pairs according to their prediction confidence. The second is knowledge-guided cache refinement, which refines pair values (i.e., pseudo-labels) and cache weights by leveraging knowledge distillation from large-scale vision language models. Extensive experiments show that NtUA achieves superior performance consistently across multiple widely adopted benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes