LG CLMay 10

Nectar: Neural Estimation of Cached-Token Attention via Regression

João Monteiro, Michal Klein, Pierre Ablin, Marco Cuturi

arXiv:2605.0977882.0

AI Analysis

For practitioners deploying large models on long-context tasks, Nectar offers a way to reduce inference cost while preserving generation quality.

Nectar replaces the O(n) attention over a long KV-cache with a compact neural network that predicts attention output, achieving inference cost independent of context length. On models up to 8B parameters across five datasets, the approximation error tracks next-token accuracy gap to full attention, and generations match semantically.

Evaluating softmax attention over a fixed long context requires reading every cached key-value pair for each new query token. For a given context (a book, a manual, a legal corpus) the attention output is a deterministic function of the query. We propose Nectar, which fits a compact neural network to this function for queries drawn from a task-relevant distribution. Nectar fits two networks per layer and KV-head: a target network that predicts the attention output and a score network that predicts the log-normalizer. The pair plugs into the standard masked self-attention at inference time, replacing the $O(n)$ attention over the cache with a forward pass whose cost does not depend on $n$. Each module carries on the order of $|θ|$ parameters per layer and KV-head, typically much smaller than the $2nd$ KV-cache footprint at the same granularity. We report experiments on models from 1.7B to 8B parameters across five long-context datasets. The approximation error tracks the next-token accuracy gap to full attention, and allocating capacity non-uniformly across layers reduces that gap in our ablation. Beyond this analysis of metrics, we check that the text generations (following a question prompt) of a model equipped with a Nectar module match in semantic content those obtained by giving the same model access to the full cache.

View on arXiv PDF

Similar