CVSep 8, 2023

Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment

arXiv:2309.04158v15 citationsh-index: 8
Originality Highly original
AI Analysis

This work addresses the challenge of efficient and interpretable prompt tuning for vision-language models in few-shot learning scenarios, offering a novel method that combines explicit and implicit context modeling.

The paper tackles the problem of adapting vision-language models to downstream tasks with few training samples by introducing Dual-Aligned Prompt Tuning (DuAl-PT), which incorporates pre-trained large language models to generate context descriptions and aligns prompts with both LLM knowledge and local image features, achieving superior performance on 11 datasets for few-shot recognition and base-to-new generalization.

Large-scale vision-language models (VLMs), e.g., CLIP, learn broad visual concepts from tedious training data, showing superb generalization ability. Amount of prompt learning methods have been proposed to efficiently adapt the VLMs to downstream tasks with only a few training samples. We introduce a novel method to improve the prompt learning of vision-language models by incorporating pre-trained large language models (LLMs), called Dual-Aligned Prompt Tuning (DuAl-PT). Learnable prompts, like CoOp, implicitly model the context through end-to-end training, which are difficult to control and interpret. While explicit context descriptions generated by LLMs, like GPT-3, can be directly used for zero-shot classification, such prompts are overly relying on LLMs and still underexplored in few-shot domains. With DuAl-PT, we propose to learn more context-aware prompts, benefiting from both explicit and implicit context modeling. To achieve this, we introduce a pre-trained LLM to generate context descriptions, and we encourage the prompts to learn from the LLM's knowledge by alignment, as well as the alignment between prompts and local image features. Empirically, DuAl-PT achieves superior performance on 11 downstream datasets on few-shot recognition and base-to-new generalization. Hopefully, DuAl-PT can serve as a strong baseline. Code will be available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes