CVFeb 25

Vision Transformers Need More Than Registers

arXiv:2602.22394v18 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses performance artifacts in Vision Transformers for computer vision applications, representing an incremental improvement.

The paper identifies that Vision Transformers (ViTs) exhibit artifacts due to lazy aggregation behavior where they use irrelevant background patches as shortcuts, and proposes a solution that selectively integrates patch features into the CLS token to improve performance across 12 benchmarks under various supervision types.

Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes