CVSep 7, 2025

AttriPrompt: Dynamic Prompt Composition Learning for CLIP

Qiqi Zhan, Shiwei Li, Qingjie Liu, Yunhong Wang

arXiv:2509.05949v13.62 citationsh-index: 56MM

Originality Incremental advance

AI Analysis

This work addresses fine-grained feature optimization and content-aware adaptation in vision-language models, offering incremental improvements for real-world applications.

The paper tackled the limitations of deep text prompting in CLIP, such as over-reliance on contrastive learning and static prompts, by proposing AttriPrompt, which uses visual features to dynamically compose prompts and achieves up to 7.37% improvement in base-to-novel settings.

The evolution of prompt learning methodologies has driven exploration of deeper prompt designs to enhance model performance. However, current deep text prompting approaches suffer from two critical limitations: Over-reliance on constrastive learning objectives that prioritize high-level semantic alignment, neglecting fine-grained feature optimization; Static prompts across all input categories, preventing content-aware adaptation. To address these limitations, we propose AttriPrompt-a novel framework that enhances and refines textual semantic representations by leveraging the intermediate-layer features of CLIP's vision encoder. We designed an Attribute Retrieval module that first clusters visual features from each layer. The aggregated visual features retrieve semantically similar prompts from a prompt pool, which are then concatenated to the input of every layer in the text encoder. Leveraging hierarchical visual information embedded in prompted text features, we introduce Dual-stream Contrastive Learning to realize fine-grained alignment. Furthermore, we introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features to prevent overfitting on limited training data. Extensive experiments across three benchmarks demonstrate AttriPrompt's superiority over state-of-the-art methods, achieving up to 7.37\% improvement in the base-to-novel setting. The observed strength of our method in cross-domain knowledge transfer positions vision-language pre-trained models as more viable solutions for real-world implementation.

View on arXiv PDF

Similar