$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones
This work addresses the trade-off between localization quality and representation richness in ViTs for robust visual classification, offering a simple, training-free method that improves performance under distribution shifts.
The authors find that smaller self-supervised Vision Transformers (ViTs) produce attention maps that localize foreground objects better than larger ViTs. They propose A², a method that uses a small ViT for localization and a large ViT for feature extraction, achieving competitive or superior performance on 5 benchmarks without additional training.
Robust visual classification often depends on localizing the main foreground objects in an image while ignoring contextual distractors. Surprisingly, we find that the attention maps of smaller self-supervised ViTs localize foreground objects better than those of larger ViTs. However, we still need large ViTs, because they extract richer representations from each patch. To get the best of both worlds, good localization and rich representations, we propose $A^2$, a simple method that leverages this inverse scaling finding by decoupling where to look (a small attention model) from what to extract (a large embedding model): we crop around the attention peaks of a small model and embed the crops with a larger model. $A^2$ uses entirely pretrained features, requires no group labels, and does not require per-dataset attention or backbone training. Across 5 benchmarks, $A^2$ is competitive with backbone-matched loss-level methods like DFR, and outperforms end-to-end attention training under stronger distribution shifts.