CVFeb 19, 2025

Modular Prompt Learning Improves Vision-Language Models

Zhenhan Huang, Tejaswini Pedapati, Pin-Yu Chen, Jianxi Gao

arXiv:2502.14125v16.22 citationsh-index: 13Has CodeICASSP

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in prompt learning for vision-language models, offering incremental improvements for researchers and practitioners in efficient model adaptation.

The paper tackles the problem of information loss in deep prompt learning for vision-language models by proposing Modular Prompt Learning (MPL), which improves performance by preserving prompt information across transformer layers, achieving an average 0.7% gain on base-to-new generalization across 11 datasets.

Pre-trained vision-language models are able to interpret visual concepts and language semantics. Prompt learning, a method of constructing prompts for text encoders or image encoders, elicits the potentials of pre-trained models and readily adapts them to new scenarios. Compared to fine-tuning, prompt learning enables the model to achieve comparable or better performance using fewer trainable parameters. Besides, prompt learning freezes the pre-trained model and avoids the catastrophic forgetting issue in the fine-tuning. Continuous prompts inserted into the input of every transformer layer (i.e. deep prompts) can improve the performances of pre-trained models on downstream tasks. For i-th transformer layer, the inserted prompts replace previously inserted prompts in the $(i-1)$-th layer. Although the self-attention mechanism contextualizes newly inserted prompts for the current layer and embeddings from the previous layer's output, removing all inserted prompts from the previous layer inevitably loses information contained in the continuous prompts. In this work, we propose Modular Prompt Learning (MPL) that is designed to promote the preservation of information contained in the inserted prompts. We evaluate the proposed method on base-to-new generalization and cross-dataset tasks. On average of 11 datasets, our method achieves 0.7% performance gain on the base-to-new generalization task compared to the state-of-the-art method. The largest improvement on the individual dataset is 10.7% (EuroSAT dataset).

View on arXiv PDF Code

Similar