LGAICLFeb 6

The Condensate Theorem: Transformers are O(n), Not $O(n^2)$

arXiv:2602.06317v2h-index: 1
Originality Highly original
AI Analysis

This addresses the scalability problem for large language model inference by eliminating the quadratic bottleneck, potentially reducing costs by >99.9%.

The paper tackles the quadratic computational bottleneck in transformer attention by proving that attention sparsity is a learned topological property, enabling lossless compression to linear complexity with 100% output equivalence, validated by a 159x speedup at 131K tokens and projected >1,200x speedup at 1M tokens.

We present the Condensate Theorem: attention sparsity is a learned topological property, not an architectural constraint. Through empirical analysis of trained language models, we find that attention mass concentrates on a distinct topological manifold -- and this manifold can be identified dynamically without checking every position. We prove a general result: for any query, projecting attention onto the Condensate Manifold (Anchor + Window + Dynamic Top-k) achieves 100% output equivalence with full $O(n^2)$ attention. This is not an approximation -- it is lossless parity. We validate this across GPT-2, Pythia, Qwen2, TinyLlama, and Mistral, demonstrating bit-exact token matching on 1,500+ generated tokens. By mapping this topology to hardware, our Topological Attention kernel achieves a 159x measured speedup at 131K tokens (3.94ms vs 628ms) and a projected >1,200x speedup at 1M tokens, reducing inference costs by >99.9% compared to Flash Attention. We conclude that the quadratic bottleneck is an artifact of naive implementation, not intelligence.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes