MINT: Memory-Infused Prompt Tuning at Test-time for CLIP
This addresses the challenge of adapting pre-trained models to new data distributions at test time without retraining, which is incremental but practical for real-world applications.
The paper tackles the problem of improving generalization for Vision-Language Pre-trained Models under test-time data distribution shifts by proposing MINT, a framework that uses a Memory Prompt Bank to dynamically adapt prompts, achieving state-of-the-art results on benchmarks like ImageNet-R and DomainNet.
Improving the generalization ability of Vision-Language Pre-trained Models (VLMs) under test-time data distribution shifts remains a critical challenge. The existing Test-Time Adaptation (TTA) methods fall short in fully leveraging the model's internal knowledge, particularly in dynamically adapting to complex and hierarchical visual semantic information. In this paper, we propose Memory-Infused Prompt Tuning (MINT), a novel framework to address this issue. Inspired by human associative memory theory, MINT introduces a Memory Prompt Bank (MPB), which stores learnable key-value prompt pairs that work as a memory of previously seen samples. During the test time, relevant prompt pairs in the MPB are retrieved by the hierarchical visual features of test images to dynamically assemble Associative Prompts. The associative prompts are then injected into the image encoder for fine-grained, customized visual contextual guidance. MINT also utilizes learnable text prompts. MINT thus enables rapid, precise VLM adaptation at test time by leveraging this MPB-acquired memory, without source data or retraining. The code is available at https://github.com/Jamieyi2004/MINT.