Noise-Tolerant Few-Shot Unsupervised Adapter for Vision-Language Models
This addresses the scalability and generalizability issues in visual recognition tasks by enabling few-shot learning without labels, though it is incremental as it builds on existing adapter methods.
The paper tackles the problem of requiring labeled target samples for few-shot adaptation in vision-language models by proposing NtUA, a noise-tolerant unsupervised adapter that uses few unlabelled samples, achieving superior performance across multiple benchmarks.
Recent advances in large-scale vision-language models have achieved impressive performance in various zero-shot image classification tasks. While prior studies have demonstrated significant improvements by introducing few-shot labelled target samples, they still require labelling of target samples, which greatly degrades their scalability and generalizability while handling various visual recognition tasks. We design NtUA, a Noise-tolerant Unsupervised Adapter that allows the learning of effective target models with few unlabelled target samples. NtUA works as a key-value cache that formulates visual features and predicted pseudo-labels of the few unlabelled target samples as key-value pairs. It consists of two complementary designs. The first is adaptive cache formation that combats pseudo-label noises by weighting the key-value pairs according to their prediction confidence. The second is knowledge-guided cache refinement, which refines pair values (i.e., pseudo-labels) and cache weights by leveraging knowledge distillation from large-scale vision language models. Extensive experiments show that NtUA achieves superior performance consistently across multiple widely adopted benchmarks.