AICVLGFeb 17, 2025

Learning Generalizable Prompt for CLIP with Class Similarity Knowledge

arXiv:2502.11969v12 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses a generalization issue in prompt tuning for vision-language models, offering an incremental improvement for downstream task adaptation.

The paper tackles the problem of learned prompts in vision-language models overfitting to seen classes and failing to generalize to unseen classes, proposing Similarity Alignment Regularization (SAR) to preserve semantic relationships, which improves generalization as demonstrated in experiments.

In vision-language models (VLMs), prompt tuning has shown its effectiveness in adapting models to downstream tasks. However, learned prompts struggle to generalize to unseen classes, as they tend to overfit to the classes that are targeted during prompt tuning. Examining failure cases, we observed that learned prompts disrupt the semantics of unseen classes, generating text embeddings with incorrect semantic relationships among classes. To address this, we propose Similarity Alignment Regularization (SAR), which regularizes learnable prompts to preserve the semantic relationships among classes captured by hand-crafted prompts. Specifically, we first obtain novel classes related to base classes using ChatGPT-4o and utilize them as potential unseen classes during prompt tuning. Then, by targeting both base and novel classes, SAR aligns the similarity relationships among text embeddings generated by learnable prompts with the similarity relationships from hand-crafted prompts. Extensive experiments applying SAR to existing prompt tuning methods demonstrate its effectiveness in improving generalization to unseen classes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes