Self-Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

arXiv:2602.00294v11.4h-index: 3

Originality Highly original

AI Analysis

This addresses the infrastructure and energy demands of large-scale Transformer models, enabling unbounded token generation at fixed cost.

The paper tackles the problem of self-attention's increasing computational and memory costs with context length in Transformers, achieving constant cost per token and orders-of-magnitude reductions in memory and computation.

The most widely used artificial intelligence (AI) models today are Transformers employing self-attention. In its standard form, self-attention incurs costs that increase with context length, driving demand for storage, compute, and energy that is now outstripping society's ability to provide them. To help address this issue, we show that self-attention is efficiently computable to arbitrary precision with constant cost per token, achieving orders-of-magnitude reductions in memory use and computation. We derive our formulation by decomposing the conventional formulation's Taylor expansion into expressions over symmetric chains of tensor products. We exploit their symmetry to obtain feed-forward transformations that efficiently map queries and keys to coordinates in a minimal polynomial-kernel feature basis. Notably, cost is fixed inversely in proportion to head size, enabling application over a greater number of heads per token than otherwise feasible. We implement our formulation and empirically validate its correctness. Our work enables unbounded token generation at modest fixed cost, substantially reducing the infrastructure and energy demands of large-scale Transformer models. The mathematical techniques we introduce are of independent interest.

View on arXiv PDF

Similar