CVLGMay 17, 2021

Vision Transformers are Robust Learners

arXiv:2105.07581v3376 citations
Originality Incremental advance
AI Analysis

This work addresses the robustness gap in vision models for researchers and practitioners, showing ViTs' superior performance in real-world scenarios, though it is incremental in extending existing methods to robustness evaluation.

The study evaluated the robustness of Vision Transformers (ViTs) against various corruptions, distribution shifts, and adversarial examples, finding that ViTs achieve significantly higher accuracy than convolutional neural networks, such as a 4.3x improvement on ImageNet-A with 28.10% top-1 accuracy.

Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy. What remains largely unexplored is their robustness evaluation and attribution. In this work, we study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We use six different diverse ImageNet datasets concerning robust classification to conduct a comprehensive performance comparison of ViT models and SOTA convolutional neural networks (CNNs), Big-Transfer. Through a series of six systematically designed experiments, we then present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners. For example, with fewer parameters and similar dataset and pre-training combinations, ViT gives a top-1 accuracy of 28.10% on ImageNet-A which is 4.3x higher than a comparable variant of BiT. Our analyses on image masking, Fourier spectrum sensitivity, and spread on discrete cosine energy spectrum reveal intriguing properties of ViT attributing to improved robustness. Code for reproducing our experiments is available at https://git.io/J3VO0.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes