LGCLMay 10

Nectar: Neural Estimation of Cached-Token Attention via Regression

arXiv:2605.0977882.0
AI Analysis

For practitioners deploying large models on long-context tasks, Nectar offers a way to reduce inference cost while preserving generation quality.

Nectar replaces the O(n) attention over a long KV-cache with a compact neural network that predicts attention output, achieving inference cost independent of context length. On models up to 8B parameters across five datasets, the approximation error tracks next-token accuracy gap to full attention, and generations match semantically.

Evaluating softmax attention over a fixed long context requires reading every cached key-value pair for each new query token. For a given context (a book, a manual, a legal corpus) the attention output is a deterministic function of the query. We propose Nectar, which fits a compact neural network to this function for queries drawn from a task-relevant distribution. Nectar fits two networks per layer and KV-head: a target network that predicts the attention output and a score network that predicts the log-normalizer. The pair plugs into the standard masked self-attention at inference time, replacing the $O(n)$ attention over the cache with a forward pass whose cost does not depend on $n$. Each module carries on the order of $|θ|$ parameters per layer and KV-head, typically much smaller than the $2nd$ KV-cache footprint at the same granularity. We report experiments on models from 1.7B to 8B parameters across five long-context datasets. The approximation error tracks the next-token accuracy gap to full attention, and allocating capacity non-uniformly across layers reduces that gap in our ablation. Beyond this analysis of metrics, we check that the text generations (following a question prompt) of a model equipped with a Nectar module match in semantic content those obtained by giving the same model access to the full cache.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes