CVAILGJun 28, 2022

Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment

arXiv:2206.13951v136 citationsh-index: 20
Originality Incremental advance
AI Analysis

This work addresses robustness issues in Vision Transformers for image classification tasks, offering a practical solution for real-world deployment without costly retraining, though it is incremental as it builds on existing test-time adaptation techniques.

The paper tackles the problem of improving Vision Transformer robustness to corruptions and domain shifts without retraining by proposing a test-time adaptation method called class-conditional feature alignment (CFA), which achieves a state-of-the-art top-1 error rate of 19.8% on ImageNet-C, outperforming baselines by a large margin.

Vision Transformer (ViT) is becoming more popular in image processing. Specifically, we investigate the effectiveness of test-time adaptation (TTA) on ViT, a technique that has emerged to correct its prediction during test-time by itself. First, we benchmark various test-time adaptation approaches on ViT-B16 and ViT-L16. It is shown that the TTA is effective on ViT and the prior-convention (sensibly selecting modulation parameters) is not necessary when using proper loss function. Based on the observation, we propose a new test-time adaptation method called class-conditional feature alignment (CFA), which minimizes both the class-conditional distribution differences and the whole distribution differences of the hidden representation between the source and target in an online manner. Experiments of image classification tasks on common corruption (CIFAR-10-C, CIFAR-100-C, and ImageNet-C) and domain adaptation (digits datasets and ImageNet-Sketch) show that CFA stably outperforms the existing baselines on various datasets. We also verify that CFA is model agnostic by experimenting on ResNet, MLP-Mixer, and several ViT variants (ViT-AugReg, DeiT, and BeiT). Using BeiT backbone, CFA achieves 19.8% top-1 error rate on ImageNet-C, outperforming the existing test-time adaptation baseline 44.0%. This is a state-of-the-art result among TTA methods that do not need to alter training phase.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes