LGAICVNov 26, 2025

Mechanisms of Non-Monotonic Scaling in Vision Transformers

arXiv:2511.21635v1h-index: 1Has Code
Originality Incremental advance
AI Analysis

This addresses a counterintuitive scaling problem in vision transformers for computer vision researchers, providing diagnostic tools and design insights.

The paper investigates why deeper Vision Transformers sometimes perform worse than shallower ones, identifying a Cliff-Plateau-Climb pattern in representation evolution and showing that better performance correlates with reduced reliance on the [CLS] token in favor of distributed patch token consensus, with ViT-L exhibiting information-task tradeoffs about 10 layers later than ViT-B.

Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime may benefit more from carefully calibrated depth that executes clean phase transitions than from simply increasing parameter count. The Information Scrambling Index provides a useful diagnostic for existing models and suggests a potential design target for future architectures. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes