CVCLJun 20, 2023

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

arXiv:2306.11400v25 citationsh-index: 16Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of breaking alignment in pre-trained vision-language models for researchers and practitioners in few-shot vision recognition and out-of-domain generalization, representing an incremental improvement over existing prompt tuning methods.

The paper tackles sub-optimal performance in uni-modal prompt tuning for vision-language models by proposing MuDPT, a multi-modal approach that learns a transformative network for deep bi-directional prompt fusion, achieving better recognition and generalization with an apparent margin compared to state-of-the-art methods.

Prompt tuning, like CoOp, has recently shown promising vision recognizing and transfer learning ability on various downstream tasks with the emergence of large pre-trained vision-language models like CLIP. However, we identify that existing uni-modal prompt tuning approaches may result in sub-optimal performance since this uni-modal design breaks the original alignment of textual and visual representations in the pre-trained model. Inspired by the nature of pre-trained vision-language models, we aim to achieve completeness in prompt tuning and propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT, which extends independent multi-modal prompt tuning by additionally learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion. We evaluate the effectiveness of MuDPT on few-shot vision recognition and out-of-domain generalization tasks. Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin thanks to synergistic alignment of textual and visual representations. Our code is available at: https://github.com/Mechrev0/MuDPT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes