CVMar 10, 2024

In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

arXiv:2403.06126v23 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the issue of performance degradation in vision-language models for downstream tasks when test inputs differ, offering a test-time adaptation method that is incremental but effective.

The paper tackles the problem of adapting frozen vision-language models like CLIP to novel test distributions by proposing In-Context Prompt Learning (InCPL), which uses few labeled examples as context to enable reliable label estimation and achieves state-of-the-art results across various datasets.

Current pre-trained vision-language models, such as CLIP, have demonstrated remarkable zero-shot generalization capabilities across various downstream tasks. However, their performance significantly degrades when test inputs exhibit different distributions. In this paper, we explore the concept of test-time prompt tuning (TTPT), which facilitates the adaptation of the CLIP model to novel downstream tasks through a one-step unsupervised optimization that involves only test samples. Inspired by in-context learning in natural language processing (NLP), we propose In-Context Prompt Learning (InCPL) for test-time visual recognition tasks, which empowers a pre-trained vision-language model with labeled examples as context information on downstream task. Specifically, InCPL associates a new test sample with very few labeled examples (sometimes just one) as context information, enabling reliable label estimation for the test sample and facilitating model adaptation. To achieve this, InCPL employs an efficient language-to-vision translator to explore the textual prior information for visual prompt learning. Further, we introduce a context-aware unsupervised loss to optimize visual prompts tailored to test samples. Finally, we design a cyclic learning strategy for visual and textual prompts to ensure mutual synergy across different modalities. This enables a pre-trained, frozen CLIP model to adapt to any task using its learned adaptive prompt. Our method demonstrates superior performance and achieves state-of-the-art results across various downstream datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes