CVAug 19, 2023

DPL: Decoupled Prompt Learning for Vision-Language Models

arXiv:2308.10061v15 citationsh-index: 65
Originality Incremental advance
AI Analysis

This addresses generalization issues in vision-language models for downstream tasks, but it is incremental as it builds on existing prompt learning methods.

The paper tackled the problem of prompt learning overfitting to seen categories in vision-language models by proposing Decoupled Prompt Learning (DPL), which reformulates attention into sub-processes and achieves state-of-the-art performance on 15 image recognition datasets while being parameter-efficient.

Prompt learning has emerged as an efficient and effective approach for transferring foundational Vision-Language Models (e.g., CLIP) to downstream tasks. However, current methods tend to overfit to seen categories, thereby limiting their generalization ability for unseen classes. In this paper, we propose a new method, Decoupled Prompt Learning (DPL), which reformulates the attention in prompt learning to alleviate this problem. Specifically, we theoretically investigate the collaborative process between prompts and instances (i.e., image patches/text tokens) by reformulating the original self-attention into four separate sub-processes. Through detailed analysis, we observe that certain sub-processes can be strengthened to bolster robustness and generalizability by some approximation techniques. Furthermore, we introduce language-conditioned textual prompting based on decoupled attention to naturally preserve the generalization of text input. Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning. By combining the proposed techniques, our approach achieves state-of-the-art performance on three representative benchmarks encompassing 15 image recognition datasets, while maintaining parameter-efficient. Moreover, our DPL does not rely on any auxiliary regularization task or extra training data, further demonstrating its remarkable generalization ability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes