CVSep 25, 2023

Tuning Multi-mode Token-level Prompt Alignment across Modalities

arXiv:2309.13847v244 citationsh-index: 61
Originality Incremental advance
AI Analysis

This work addresses a domain-specific limitation in prompt tuning for vision-language models, offering an incremental improvement over prior methods.

The paper tackles the problem of sub-optimal prompt discovery in vision-language models by proposing a multi-mode token-level tuning framework that uses optimal transportation to align prompt tokens across modalities, resulting in superior generalization and few-shot abilities on image recognition benchmarks.

Advancements in prompt tuning of vision-language models have underscored their potential in enhancing open-world visual concept comprehension. However, prior works only primarily focus on single-mode (only one prompt for each modality) and holistic level (image or sentence) semantic alignment, which fails to capture the sample diversity, leading to sub-optimal prompt discovery. To address the limitation, we propose a multi-mode token-level tuning framework that leverages the optimal transportation to learn and align a set of prompt tokens across modalities. Specifically, we rely on two essential factors: 1) multi-mode prompts discovery, which guarantees diverse semantic representations, and 2) token-level alignment, which helps explore fine-grained similarity. Consequently, the similarity can be calculated as a hierarchical transportation problem between the modality-specific sets. Extensive experiments on popular image recognition benchmarks show the superior generalization and few-shot abilities of our approach. The qualitative analysis demonstrates that the learned prompt tokens have the ability to capture diverse visual concepts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes