LGMay 27

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

arXiv:2605.2864071.9

AI Analysis

For researchers working on efficient long-context LLM inference, this work demonstrates a simple augmentation that boosts the accuracy of existing sparse attention methods, though the gains are incremental.

The paper shows that augmenting attention with an exponentially decaying memory module (RAT+) consistently improves accuracy over standard attention for query-aware sparse inference methods (Quest, MoBA, SnapKV) across eight needle-in-a-haystack tasks, validated on RAT+ checkpoints and OLMo2-7B after continued pretraining.

Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.

View on arXiv PDF

Similar