CVJul 29, 2024

Advancing Prompt Learning through an External Layer

Fangming Cui, Xun Yang, Chao Wu, Liang Xiao, Xinmei Tian

arXiv:2407.19674v65.25 citationsh-index: 23

Originality Incremental advance

AI Analysis

This work addresses generalization issues in prompt learning for vision-language models, offering an incremental improvement for tasks like few-shot learning and domain shifts.

The paper tackles the poor generalization of prompt learning in vision-language models by introducing an external layer with learnable visual embeddings and a two-pronged alignment approach, achieving superior performance in four experiments across 15 datasets compared to existing methods.

Prompt learning represents a promising method for adapting pre-trained vision-language models (VLMs) to various downstream tasks by learning a set of text embeddings. One challenge inherent to these methods is the poor generalization performance due to the invalidity of the learned text embeddings for unseen tasks. A straightforward approach to bridge this gap is to freeze the text embeddings in prompts, which results in a lack of capacity to adapt VLMs for downstream tasks. To address this dilemma, we propose a paradigm called EnPrompt with a novel External Layer (EnLa). Specifically, we propose a textual external layer and learnable visual embeddings for adapting VLMs to downstream tasks. The learnable external layer is built upon valid embeddings of pre-trained CLIP. This design considers the balance of learning capabilities between the two branches. To align the textual and visual features, we propose a novel two-pronged approach: i) we introduce the optimal transport as the discrepancy metric to align the vision and text modalities, and ii) we introduce a novel strengthening feature to enhance the interaction between these two modalities. Four representative experiments (i.e., base-to-novel generalization, few-shot learning, cross-dataset generalization, domain shifts generalization) across 15 datasets demonstrate that our method outperforms the existing prompt learning method.

View on arXiv PDF

Similar