CVSep 4, 2025

Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model

arXiv:2509.03895v1h-index: 6
Originality Highly original
AI Analysis

This work addresses the problem of computationally intensive fine-tuning for few-shot scenarios in vision-language models, offering a more efficient solution for researchers and practitioners.

The paper tackled the challenge of few-shot learning for vision-language models by proposing Attn-Adapter, an online framework that enhances CLIP's adaptability without retraining, resulting in improved cross-category and cross-dataset generalization.

Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning using prompt learning, which risks overfitting. To overcome these limitations, we propose Attn-Adapter, a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism. Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. This architecture enables dynamic adaptation from a few labeled samples without retraining the base model. Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes