DS AIMay 7

Nearly Optimal Attention Coresets

Edo Liberty, Alexandr Andoni, Eldar Kleiner

arXiv:2605.0560258.4

AI Analysis

This provides a theoretical foundation for efficient attention computation, benefiting large-scale transformer models by reducing memory and computational costs.

The paper proves the existence of nearly optimal coresets for attention mechanisms, achieving size O(√d e^{ρ+o(ρ)}/ε) for approximating attention for all bounded-norm queries, with a matching lower bound of Ω(√d e^ρ/ε).

We consider the problem of estimating the Attention mechanism in small space, and prove the existence of coresets for it of nearly optimal size. Specifically, we show that for any set of unit-norm keys and values $(K,V)$ in $\mathbb{R}^d$, there exists a subset $(K',V')$ of size at most $O({\sqrt{d} e^{ρ+o(ρ)}/\varepsilon})$ such that \[ \left\| \operatorname{Attn}(q,K,V)- \operatorname{Attn}(q,K',V') \right\| \le \varepsilon \] simultaneously for all queries whose norm is bounded by $ρ$. This outperforms the best known results for this problem. We also offer an improved lower bound showing that $\varepsilon$-coresets must have size $Ω({\sqrt{d} e^ρ/ε})$.

View on arXiv PDF

Similar