CVLGMLMar 5

Layer by layer, module by module: Choose both for optimal OOD probing of ViT

arXiv:2603.05280v1
Originality Incremental advance
AI Analysis

This research helps practitioners and researchers understand and optimize the use of intermediate layers in vision transformers for out-of-distribution tasks, particularly in image classification.

This paper investigates why intermediate layers of vision transformers often produce more discriminative representations than the final layer, finding that distribution shift between pretraining and downstream data is the main cause of performance degradation in deeper layers. They show that probing activations within the feedforward network is best for significant distribution shift, while probing the normalized output of the multi-head self-attention module is optimal for weak shifts.

Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes