CVNov 12, 2025

Ultra-Light Test-Time Adaptation for Vision--Language Models

arXiv:2511.09101v12 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the challenge of adapting VLMs for streaming and edge scenarios under domain shift, offering a lightweight solution with state-of-the-art performance, though it is incremental as it builds on existing test-time adaptation methods.

The paper tackled the problem of domain shift in Vision-Language Models (VLMs) like CLIP, which causes feature drift and miscalibration, by proposing Ultra-Light Test-Time Adaptation (UL-TTA), a training-free method that adapts only logit-level parameters, resulting in an average improvement of +4.7 points in top-1 accuracy over zero-shot CLIP and a 20-30% reduction in ECE with minimal latency overhead.

Vision-Language Models (VLMs) such as CLIP achieve strong zero-shot recognition by comparing image embeddings to text-derived class prototypes. However, under domain shift, they suffer from feature drift, class-prior mismatch, and severe miscalibration. Existing test-time adaptation (TTA) methods often require backpropagation through large backbones, covariance estimation, or heavy memory/state, which is problematic for streaming and edge scenarios. We propose Ultra-Light Test-Time Adaptation (UL-TTA), a fully training-free and backprop-free framework that freezes the backbone and adapts only logit-level parameters: class prototypes, class priors, and temperature. UL-TTA performs an online EM-style procedure with (i) selective sample filtering to use only confident predictions, (ii) closed-form Bayesian updates for prototypes and priors anchored by text and Dirichlet priors, (iii) decoupled temperatures for prediction vs. calibration, and (iv) lightweight guards (norm clipping, prior KL constraints, smoothed temperature) to prevent drift in long streams. Across large-scale cross-domain and OOD benchmarks (PACS, Office-Home, DomainNet, Terra Incognita, ImageNet-R/A/V2/Sketch; ~726K test samples) and strong TTA baselines including Tent, T3A, CoTTA, SAR, Tip-Adapter, and FreeTTA, UL-TTA consistently improves top-1 accuracy (e.g., +4.7 points over zero-shot CLIP on average) while reducing ECE by 20-30%, with less than 8% latency overhead. Long-stream experiments up to 200K samples show no collapse. Our results demonstrate that logit-level Bayesian adaptation is sufficient to obtain state-of-the-art accuracy-calibration trade-offs for VLMs under domain shift, without updating any backbone parameters.

View on arXiv PDF

Similar