MAO: Efficient Model-Agnostic Optimization of Prompt Tuning for Vision-Language Models
This work addresses efficiency issues in prompt tuning for vision-language models, offering a practical solution for researchers and practitioners, though it appears incremental as it builds on existing prompt tuning methods.
The paper tackles the problem of increased complexity and training cost in CLIP-based prompt tuning for vision-language models by proposing Model-Agnostic Optimization (MAO), a plug-and-play method that improves performance while maintaining low computational cost, as demonstrated through extensive experiments.
Though CLIP-based prompt tuning significantly enhances pre-trained Vision-Language Models, existing research focuses on reconstructing the model architecture, e.g., additional loss calculation and meta-networks. These approaches generally lead to increased complexity and extended training cost. To maintain the efficiency of the tuning process, we propose plug-and-play Model-Agnostic Optimization (MAO) for prompt tuning. Without altering any components of the prompt tuning backbone, we introduce a Data-Driven Enhancement framework to optimize the distribution of the initial data, and incorporate an Alterable Regularization module to boost the task-specific feature processing pipeline, thereby improving overall performance while maintaining low computational cost. Extensive experiments on MAO demonstrate its outstanding performance and efficiency. The code of MAO is available at: https://github.com/JREion/M.A.O .