Multiple Stochastic Prompt Tuning for Few-shot Adaptation under Extreme Domain Shift
This addresses the challenge of few-shot learning under extreme domain shifts for real-world deployment where all classes must be processed at once, representing an incremental improvement over existing methods.
The paper tackles the problem of adapting foundation vision-language models like CLIP to datasets with extreme distribution shifts using only a few labeled examples, proposing MIST (Multiple Stochastic Prompt Tuning) to handle all classes simultaneously, and shows effectiveness through experiments with state-of-the-art comparisons.
Foundation Vision-Language Models (VLMs) like CLIP exhibit strong generalization capabilities due to large-scale pretraining on diverse image-text pairs. However, their performance often degrades when applied to target datasets with significant distribution shifts in both visual appearance and class semantics. Recent few-shot learning approaches adapt CLIP to downstream tasks using limited labeled data via adapter or prompt tuning, but are not specifically designed to handle such extreme domain shifts. Conversely, some works addressing cross-domain few-shot learning consider such domain-shifted scenarios but operate in an episodic setting with only a few classes per episode, limiting their applicability to real-world deployment, where all classes must be handled simultaneously. To address this gap, we propose a novel framework, MIST (Multiple Stochastic Prompt Tuning), for efficiently adapting CLIP to datasets with extreme distribution shifts using only a few labeled examples, in scenarios involving all classes at once. Specifically, we introduce multiple learnable prompts per class to effectively capture diverse modes in visual representations arising from distribution shifts. To further enhance generalization, these prompts are modeled as learnable Gaussian distributions, enabling efficient exploration of the prompt parameter space and reducing overfitting caused by limited supervision. Extensive experiments and comparisons with state-of-the-art methods demonstrate the effectiveness of the proposed framework.