CVJan 2, 2025

SAFER: Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers

arXiv:2501.01529v13 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses the vulnerability of vision transformers to adversarial perturbations, which is crucial for robust computer vision applications, though it is an incremental improvement over existing fine-tuning techniques.

The paper tackles adversarial overfitting in vision transformers by introducing SAFER, a layer-selective fine-tuning method that improves both clean and adversarial accuracy, with typical gains around 5% and up to 20% in some cases.

Vision transformers (ViTs) have become essential backbones in advanced computer vision applications and multi-modal foundation models. Despite their strengths, ViTs remain vulnerable to adversarial perturbations, comparable to or even exceeding the vulnerability of convolutional neural networks (CNNs). Furthermore, the large parameter count and complex architecture of ViTs make them particularly prone to adversarial overfitting, often compromising both clean and adversarial accuracy. This paper mitigates adversarial overfitting in ViTs through a novel, layer-selective fine-tuning approach: SAFER. Instead of optimizing the entire model, we identify and selectively fine-tune a small subset of layers most susceptible to overfitting, applying sharpness-aware minimization to these layers while freezing the rest of the model. Our method consistently enhances both clean and adversarial accuracy over baseline approaches. Typical improvements are around 5%, with some cases achieving gains as high as 20% across various ViT architectures and datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes