CVApr 16, 2024

Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models

Enming Zhang, Bingke Zhu, Yingying Chen, Qinghai Miao, Ming Tang, Jinqiao Wang

arXiv:2404.10357v32.01 citationsh-index: 26IEEE transactions on multimedia

Originality Incremental advance

AI Analysis

This addresses a bottleneck in adapting VLMs to downstream tasks, though it appears incremental as an enhancement to existing prompt tuning methods.

The paper tackles the limitation of prompt template diversity in Vision-Language Models (VLMs) like CLIP, which restricts adaptation to downstream tasks and can cause incorrect predictions. The proposed CoKnow framework enhances prompt learning with multi-knowledge representation, outperforming previous methods on 11 datasets.

Vision-Language Models (VLMs), such as CLIP, play a foundational role in various cross-modal applications. To fully leverage VLMs' potential in adapting to downstream tasks, context optimization methods like Prompt Tuning are essential. However, one key limitation is the lack of diversity in prompt templates, whether they are hand-crafted or learned through additional modules. This limitation restricts the capabilities of pretrained VLMs and can result in incorrect predictions in downstream tasks. To address this challenge, we propose Context Optimization with Multi-Knowledge Representation (CoKnow), a framework that enhances Prompt Learning for VLMs with rich contextual knowledge. To facilitate CoKnow during inference, we trained lightweight semantic knowledge mappers, which are capable of generating Multi-Knowledge Representation for an input image without requiring additional priors. Experimentally, We conducted extensive experiments on 11 publicly available datasets, demonstrating that CoKnow outperforms a series of previous methods.

View on arXiv PDF

Similar