CVCLLGMar 16, 2023

Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models

arXiv:2303.09100v212 citationsh-index: 46
AI Analysis

This work addresses the challenge of creating effective prompts for downstream applications in vision-language models, offering a novel method that improves generalization over existing manual or point-estimation techniques.

The paper tackles the problem of prompt tuning for vision-language models by introducing a Bayesian probabilistic approach that generates stochastic prompts hierarchically and aligns them with visual patches, achieving promising transferability and generalization across 15 datasets in tasks like few-shot recognition and domain shifts.

For downstream applications of vision-language pre-trained models, there has been significant interest in constructing effective prompts. Existing works on prompt engineering, which either require laborious manual designs or optimize the prompt tuning as a point estimation problem, may fail to describe diverse characteristics of categories and limit their applications. We introduce a Bayesian probabilistic resolution to prompt tuning, where the label-specific stochastic prompts are generated hierarchically by first sampling a latent vector from an underlying distribution and then employing a lightweight generative model. Importantly, we semantically regularize the tuning process by minimizing the statistical distance between the visual patches and linguistic prompts, which pushes the stochastic label representations to faithfully capture diverse visual concepts, instead of overfitting the training categories. We evaluate the effectiveness of our approach on four tasks: few-shot image recognition, base-to-new generalization, dataset transfer learning, and domain shifts. Extensive results over 15 datasets show promising transferability and generalization performance of our proposed model, both quantitatively and qualitatively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes