LGMay 7

Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

arXiv:2605.0676380.0

Predicted impact top 15% in LG · last 90 daysOriginality Highly original

AI Analysis

For LLM inference, this work provides a theoretically grounded, efficient KV cache index that ensures recall guarantees, addressing a critical overlooked dimension in sparse attention.

Sparse attention for LLMs often misses critical key-value entries, causing accuracy degradation. The authors propose Louver, a novel index structure that guarantees zero false negatives for relevant keys above a threshold, outperforming prior sparse and dense attention methods in accuracy and runtime.

Sparse attention improves LLM inference efficiency by selecting a subset of key-value entries, but at the cost of potential accuracy degradation. In particular, omitting critical KV entries can induce substantial errors in model outputs. Existing methods typically operate under fixed or adaptive token budgets and provide empirical robustness or partial theoretical guarantees, yet they do not ensure zero false negatives in decoding steps, particularly since the set of relevant tokens is both query- and step-dependent. Our empirical observations confirm that missing even one critical key can lead to sharp error spikes, especially in long reasoning tasks where the set of important tokens varies throughout decoding. This observation motivates the need for indexing methods that dynamically adapt to these variations across decoding steps while guaranteeing a full recall of the relevant keys above a certain threshold. We address this challenge by reformulating sparse attention as the halfspace range searching problem. However, existing range searching indices are not suitable for modern LLM inference due to their computational and implementation overheads. To overcome this, we introduce Louver, a novel index structure tailored for efficient KV cache retrieval. Louver (i) guarantees zero false negatives with respect to a specified threshold in both theory and practice, (ii) is lightweight to integrate into existing LLM pipelines, and (iii) incorporates hardware-aware optimizations for both CPU and GPU executions. Our experiments demonstrate that Louver outperforms prior sparse attention methods in both accuracy and runtime, and is faster than highly optimized dense attentions such as FlashAttention. These results highlight that recall guarantees are a critical and overlooked dimension of sparse attention, and open a new direction for building theoretically grounded, efficient KV cache indices.

View on arXiv PDF

Similar