LGAICLCVJul 9, 2025

Weighted Multi-Prompt Learning with Description-free Large Language Model Distillation

arXiv:2507.07147v11 citationsh-index: 1ICLR
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in prompt learning for Vision Language Models, offering a more robust approach for adapting to downstream tasks without annotated data, though it is incremental in improving existing techniques.

The paper tackles the problem of high variability and low reliability in existing methods that extract text-based descriptions from Large Language Models for prompt learning in Vision Language Models, proposing a description-free multi-prompt learning method that directly distills knowledge and achieves superior performance across 11 recognition datasets.

Recent advances in pre-trained Vision Language Models (VLM) have shown promising potential for effectively adapting to downstream tasks through prompt learning, without the need for additional annotated paired datasets. To supplement the text information in VLM trained on correlations with vision data, new approaches leveraging Large Language Models (LLM) in prompts have been proposed, enhancing robustness to unseen and diverse data. Existing methods typically extract text-based responses (i.e., descriptions) from LLM to incorporate into prompts; however, this approach suffers from high variability and low reliability. In this work, we propose Description-free Multi-prompt Learning(DeMul), a novel method that eliminates the process of extracting descriptions and instead directly distills knowledge from LLM into prompts. By adopting a description-free approach, prompts can encapsulate richer semantics while still being represented as continuous vectors for optimization, thereby eliminating the need for discrete pre-defined templates. Additionally, in a multi-prompt setting, we empirically demonstrate the potential of prompt weighting in reflecting the importance of different prompts during training. Experimental results show that our approach achieves superior performance across 11 recognition datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes