CVJul 13, 2025

Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score

arXiv:2507.09615v1h-index: 1
Originality Incremental advance
AI Analysis

This work addresses fine-grained classification challenges in computer vision, offering an incremental improvement over existing methods.

The paper tackled the problem of fine-grained unsupervised adaptation for vision-language models like CLIP by introducing FAIR, which dynamically aligns image and text features to improve pseudo-labeling, resulting in a 2.78% overall gain across 13 datasets.

Vision-language models (VLMs) like CLIP excel in zero-shot learning by aligning image and text representations through contrastive pretraining. Existing approaches to unsupervised adaptation (UA) for fine-grained classification with VLMs either rely on fixed alignment scores that cannot capture evolving, subtle class distinctions or use computationally expensive pseudo-labeling strategies that limit scalability. In contrast, we show that modeling fine-grained cross-modal interactions during adaptation produces more accurate, class-discriminative pseudo-labels and substantially improves performance over state-of-the-art (SOTA) methods. We introduce Fine-grained Alignment and Interaction Refinement (FAIR), an innovative approach that dynamically aligns localized image features with descriptive language embeddings through a set of Class Description Anchors (CDA). This enables the definition of a Learned Alignment Score (LAS), which incorporates CDA as an adaptive classifier, facilitating cross-modal interactions to improve self-training in unsupervised adaptation. Furthermore, we propose a self-training weighting mechanism designed to refine pseudo-labels in the presence of inter-class ambiguities. Our approach, FAIR, delivers a substantial performance boost in fine-grained unsupervised adaptation, achieving a notable overall gain of 2.78% across 13 fine-grained datasets compared to SOTA methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes