CVJul 21, 2025

One Last Attention for Your Vision-Language Model

arXiv:2507.15480v21 citationsh-index: 3Has Code
Originality Incremental advance
AI Analysis

This work addresses the adaptation of pretrained VLMs for downstream tasks, offering a versatile fine-tuning technique that is incremental but effective across various settings.

The paper tackles the problem of fine-tuning vision-language models by proposing RAda, a method that dynamically calibrates fused representations to improve cross-modal interactions, achieving performance comparable to state-of-the-art methods with minimal modifications.

Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decision-making process, \emph{\ie} rational matrix that drives the final prediction. To bridge the gap, we propose a simple yet effective \textbf{R}ational \textbf{Ada}ptaion ({RAda}) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to intermediate features. Experiments in different settings (i.e., updating, or freezing pretrained encoders in adaptation, and test-time training that can only access the unlabeled test data) show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings. Code is available at \href{https://github.com/khufia/RAda/tree/main}{github.com/khufia/RAda}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes