Few-Shot Adversarial Low-Rank Fine-Tuning of Vision-Language Models
This addresses the need for robust adaptation of large VLMs in resource-constrained settings, offering a practical solution for few-shot scenarios, though it is incremental as it builds on existing PEFT and adversarial training methods.
The paper tackles the problem of adversarial vulnerability in few-shot fine-tuning of vision-language models like CLIP using LoRA, proposing AdvCLIP-LoRA to enhance robustness through minimax optimization, achieving state-of-the-art performance in few-shot classification and adversarial generalization across eight datasets with minimal clean accuracy loss.
Vision-Language Models (VLMs) such as CLIP have shown remarkable performance in cross-modal tasks through large-scale contrastive pre-training. To adapt these large transformer-based models efficiently for downstream tasks, Parameter-Efficient Fine-Tuning (PEFT) techniques like (Low-Rank Adaptation) LoRA have emerged as scalable alternatives to full fine-tuning, especially in few-shot scenarios. However, like traditional deep neural networks, VLMs are highly vulnerable to adversarial attacks, where imperceptible perturbations can significantly degrade model performance. Adversarial training remains the most effective strategy for improving model robustness in PEFT. In this work, we propose AdvCLIP-LoRA, to our knowledge the first method designed to enhance the adversarial robustness of CLIP models fine-tuned with LoRA in few-shot settings. Our method formulates training as a minimax optimization over low-rank adapters and adversarial perturbations, enabling robust adaptation with a small trainable footprint. Across eight datasets and two backbones (ViT-B/16 and ViT-B/32), AdvCLIP-LoRA achieves state-of-the-art performance in few-shot classification, adversarial base-to-new generalization, and cross-dataset transfer, delivering higher adversarial robustness than prompt tuning baselines without sacrificing much clean accuracy. These findings highlight AdvCLIP-LoRA as a practical approach for robust adaptation of VLMs in resource-constrained settings.