CVJan 7, 2025

Realistic Test-Time Adaptation of Vision-Language Models

arXiv:2501.03729v111 citationsh-index: 50Has CodeCVPR
Originality Incremental advance
AI Analysis

This work addresses realistic deployment challenges for vision-language models, though it is incremental as it builds on existing test-time adaptation methods.

The paper tackles the problem of test-time adaptation for vision-language models under realistic conditions, such as variable class numbers and non-i.i.d. batches, showing that current methods compromise zero-shot robustness, and introduces StatA, a method that preserves text-encoder knowledge with a novel regularization term.

The zero-shot capabilities of Vision-Language Models (VLMs) have been widely leveraged to improve predictive performance. However, previous works on transductive or test-time adaptation (TTA) often make strong assumptions about the data distribution, such as the presence of all classes. Our work challenges these favorable deployment scenarios, and introduces a more realistic evaluation framework, including: (i) a variable number of effective classes for adaptation within a single batch, and (ii) non-i.i.d. batches of test samples in online adaptation settings. We provide comprehensive evaluations, comparisons, and ablation studies that demonstrate how current transductive or TTA methods for VLMs systematically compromise the models' initial zero-shot robustness across various realistic scenarios, favoring performance gains under advantageous assumptions about the test samples' distributions. Furthermore, we introduce StatA, a versatile method that could handle a wide range of deployment scenarios, including those with a variable number of effective classes at test time. Our approach incorporates a novel regularization term designed specifically for VLMs, which acts as a statistical anchor preserving the initial text-encoder knowledge, particularly in low-data regimes. Code available at https://github.com/MaxZanella/StatA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes