CVJun 17, 2024

BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models

arXiv:2406.11309v23 citations
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in test-time adaptation for zero-shot vision-language models, offering an incremental improvement for researchers and practitioners in computer vision and multimodal AI.

The paper tackles the challenge of selecting learning rates for test-time prompt tuning in zero-shot vision-language models like CLIP, which can lead to collapsed training without validation data, and proposes BaFTA, a backpropagation-free algorithm that estimates class centroids via online clustering and aggregates predictions using Rényi Entropy, achieving consistent outperformance over state-of-the-art methods in effectiveness and efficiency.

Large-scale pretrained vision-language models like CLIP have demonstrated remarkable zero-shot image classification capabilities across diverse domains. To enhance CLIP's performance while preserving the zero-shot paradigm, various test-time prompt tuning methods have been introduced to refine class embeddings through unsupervised learning objectives during inference. However, these methods often encounter challenges in selecting appropriate learning rates to prevent collapsed training in the absence of validation data during test-time adaptation. In this study, we propose a novel backpropagation-free algorithm BaFTA for test-time adaptation of vision-language models. Instead of fine-tuning text prompts to refine class embeddings, our approach directly estimates class centroids using online clustering within a projected embedding space that aligns text and visual embeddings. We dynamically aggregate predictions from both estimated and original class embeddings, as well as from distinct augmented views, by assessing the reliability of each prediction using Rényi Entropy. Through extensive experiments, we demonstrate that BaFTA consistently outperforms state-of-the-art test-time adaptation methods in both effectiveness and efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes