CLLGMay 31

Don't Read Everything: A Curvature-Conditioned Query for Linear Attention

arXiv:2606.0129487.9
AI Analysis

For practitioners using linear attention, CCQ offers a simple, composable fix to close the retrieval gap with softmax attention without quadratic cost.

Linear attention suffers from diluted retrieval due to additive read of all past keys. The authors propose Curvature-Conditioned Query (CCQ), a read-time query contraction using the running key covariance, which improves perplexity, zero-shot accuracy, S-NIAH retrieval, length extrapolation (4K→20K), and LongBench accuracy on GLA and Gated DeltaNet with small extra cost.

Linear attention reduces the quadratic cost of softmax attention by maintaining a recurrent fast-weight state, but it consistently lags on in-context retrieval and long-context tasks. Existing remedies act on the write side of memory through gating, delta updates, or kernel feature maps, but the read step is left unchanged: every past key contributes additively to the output, so useful targets are diluted by the bulk of stored vectors. We borrow one specific piece of softmax's geometry to construct a cheap read-time contraction of the query. A second-order Taylor expansion of the softmax log-partition at the isotropic-attention point gives a local quadratic model whose curvature coincides with the running key covariance, a quantity that can be maintained with the same recurrent/chunkwise mechanism as the linear-attention state. The associated linear operator contracts the query along the high-density directions of memory before it reads the state. We call this mechanism Curvature-Conditioned Query (CCQ). CCQ modifies only the read step and is composable with any linear-attention backbone. Attached to GLA and Gated DeltaNet, it improves perplexity, zero-shot downstream accuracy, S-NIAH retrieval at and beyond the training context, length-extrapolation perplexity from 4K to 20K, and LongBench accuracy, at small extra cost.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes