CVFeb 26

Infinite Self-Attention

arXiv:2603.00175v1h-index: 2
Originality Highly original
AI Analysis

This work significantly improves the scalability and efficiency of Vision Transformers for high-resolution image processing by offering a linear-time attention mechanism that avoids out-of-memory issues at very large input sizes.

This paper introduces Infinite Self-Attention (InfSA), a spectral reformulation of self-attention that addresses the quadratic cost of softmax attention in Transformers. The proposed Linear-InfSA variant achieves stable training at 4096x4096 and inference at 9216x9216 (332k tokens), reaching 84.7% top-1 on ImageNet-1K (+3.2 points over softmax ViT) and outperforming baselines on ImageNet-V2. It also demonstrates 13x better throughput and energy efficiency than an equal-depth ViT.

The quadratic cost of softmax attention limits Transformer scalability in high-resolution vision. We introduce Infinite Self-Attention (InfSA), a spectral reformulation that treats each attention layer as a diffusion step on a content-adaptive token graph, accumulating multi-hop interactions through a discounted Neumann series over attention matrices. This links self-attention to classical graph centrality (Katz, PageRank, eigenvector centrality) for interpretable token weighting. We also show the Neumann kernel equals the fundamental matrix of an absorbing Markov chain, so a token's centrality is its expected number of random-walk visits before absorption. We then propose Linear-InfSA, a linear-time variant that approximates the principal eigenvector of the implicit attention operator without forming the full attention matrix. It keeps an auxiliary state of fixed size proportional to per-head dimension dh (independent of sequence length N), is drop-in compatible with Vision Transformers, and supports stable training at 4096 by 4096 and inference at 9216 by 9216 (about 332k tokens). In a 4-layer ViT (53.5M parameters, 59 GFLOPs at 224 by 224), Linear-InfSA reaches 84.7% top-1 on ImageNet-1K, a +3.2 point architectural gain over an equal-depth softmax ViT trained with the same recipe. On ImageNet-V2, InfViT variants outperform all compared baselines (up to 79.8% vs 76.8%), indicating robustness under distribution shift. On an A100 40GB GPU, Linear-InfViT runs at 231 images/s and 0.87 J/image (13x better throughput and energy than equal-depth ViT) and is the only tested model to complete 9216 by 9216 inference without out-of-memory. The linear approximation closely matches the dominant eigenvector of the quadratic operator (cosine 0.985).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes