LGJan 26

Superlinear Multi-Step Attention

arXiv:2601.18401v1

Originality Incremental advance

AI Analysis

This addresses the quadratic scaling problem in transformer attention for researchers and practitioners working with extremely long sequences, though it's an incremental architectural improvement rather than a paradigm shift.

The paper tackles the computational bottleneck of standard attention mechanisms for long sequences by proposing Superlinear attention, a multi-step architecture that achieves subquadratic complexity (O(L^{1+1/N})) while preserving random context access. Their implementation achieves decoding throughput of 114 tokens/sec at 1M context length and 80 tokens/sec at 10M context on a modified 30B hybrid MoE model.

In this paper, we propose \textbf{Superlinear attention}, a fully trainable multi-step attention architecture that achieves subquadratic complexity for long sequences while preserving \textbf{random context access} (a.k.a.\ structural non-exclusion): no eligible token position is structurally excluded from being selected for attention. Superlinear attention reformulates standard causal self-attention as a multi-step search problem with $N$ steps, yielding an overall complexity of $O(L^{1+\frac{1}{N}})$. To illustrate the architecture, we present a baseline $N=2$ implementation, which is algorithmically analogous to standard jump search. In this $O(L^{3/2})$ instantiation, the first step performs $O(L^{3/2})$ span-search to select relevant spans of the sequence, and the second step applies $O(L^{3/2})$ span-attention (standard attention restricted to the selected spans). In an upscaled $O(L^{1.54})$ configuration for robustness, we achieve an average decoding throughput of 114 tokens/sec at 1M context length and 80 tokens/sec at 10M context in our implementation on a modified 30B hybrid MoE model on a single B200 GPU. With limited training, we also obtain strong performance on the NIAH (Needle In A Haystack) task up to 256K context length, demonstrating that the routed span selection is learnable end-to-end. This paper emphasizes architectural formulation, scaling analysis, and systems feasibility, and presents initial validation; comprehensive quality evaluations across diverse long-context tasks are left to future work.

View on arXiv PDF

Similar