CVFeb 26

Infinite Self-Attention

arXiv:2603.00175v11.5h-index: 15

Originality Highly original

AI Analysis

This work significantly improves the scalability and efficiency of Vision Transformers for high-resolution image processing by offering a linear-time attention mechanism that avoids out-of-memory issues at very large input sizes.

This paper introduces Infinite Self-Attention (InfSA), a spectral reformulation of self-attention that addresses the quadratic cost of softmax attention in Transformers. The proposed Linear-InfSA variant achieves stable training at 4096x4096 and inference at 9216x9216 (332k tokens), reaching 84.7% top-1 on ImageNet-1K (+3.2 points over softmax ViT) and outperforming baselines on ImageNet-V2. It also demonstrates 13x better throughput and energy efficiency than an equal-depth ViT.

The quadratic cost of softmax attention limits Transformer scalability in high-resolution vision. We introduce Infinite Self-Attention (InfSA), a spectral reformulation that treats each attention layer as a diffusion step on a content-adaptive token graph, accumulating multi-hop interactions through a discounted Neumann series over attention matrices. This links self-attention to classical graph centrality (Katz, PageRank, eigenvector centrality) for interpretable token weighting. We also show the Neumann kernel equals the fundamental matrix of an absorbing Markov chain, so a token's centrality is its expected number of random-walk visits before absorption. We then propose Linear-InfSA, a linear-time variant that approximates the principal eigenvector of the implicit attention operator without forming the full attention matrix. It keeps an auxiliary state of fixed size proportional to per-head dimension dh (independent of sequence length N), is drop-in compatible with Vision Transformers, and supports stable training at 4096 by 4096 and inference at 9216 by 9216 (about 332k tokens). In a 4-layer ViT (53.5M parameters, 59 GFLOPs at 224 by 224), Linear-InfSA reaches 84.7% top-1 on ImageNet-1K, a +3.2 point architectural gain over an equal-depth softmax ViT trained with the same recipe. On ImageNet-V2, InfViT variants outperform all compared baselines (up to 79.8% vs 76.8%), indicating robustness under distribution shift. On an A100 40GB GPU, Linear-InfViT runs at 231 images/s and 0.87 J/image (13x better throughput and energy than equal-depth ViT) and is the only tested model to complete 9216 by 9216 inference without out-of-memory. The linear approximation closely matches the dominant eigenvector of the quadratic operator (cosine 0.985).

View on arXiv PDF

Similar