CVApr 29, 2023

Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT

Zhenxiang Xiao, Yuzhong Chen, Lu Zhang, Junjie Yao, Zihao Wu, Xiaowei Yu, Yi Pan, Lin Zhao, Chong Ma, Xinyu Liu, Wei Liu, Xiang Li

arXiv:2305.00201v111.620 citationsh-index: 154

Originality Incremental advance

AI Analysis

This work addresses enhancing visual classification models for researchers and practitioners, but it appears incremental as it applies known prompt techniques to a new modality.

The paper tackled adapting prompt design from instruction tuning to visual transformers for image classification, introducing Instruction-ViT with multi-modal prompts, and reported improved performance and domain adaptability in experiments on image captioning tasks.

Prompts have been proven to play a crucial role in large language models, and in recent years, vision models have also been using prompts to improve scalability for multiple downstream tasks. In this paper, we focus on adapting prompt design based on instruction tuning into a visual transformer model for image classification which we called Instruction-ViT. The key idea is to implement multi-modal prompts (text or image prompt) related to category information to guide the fine-tuning of the model. Based on the experiments of several image captionining tasks, the performance and domain adaptability were improved. Our work provided an innovative strategy to fuse multi-modal prompts with better performance and faster adaptability for visual classification models.

View on arXiv PDF

Similar