Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression

arXiv:2605.250859.0

Predicted impact top 67% in IT · last 90 daysOriginality Highly original

AI Analysis

For practitioners deploying large language models with limited memory, the paper provides theoretical bounds and empirical validation for KV cache compression, showing that recency-based eviction policies achieve power-law distortion decay.

The paper studies the rate-distortion limits of KV cache compression in autoregressive language models, finding that context truncation sensitivity decays polynomially rather than geometrically. It characterizes the per-token memory requirement of suffix-only cache policies, showing a sliding-window scheme attains distortion ε with window w = O(ε^{-1/α}), and a converse shows w = Ω(ε^{-1/α}) is necessary, yielding Θ(ε^{-1/α}) scaling.

We study the rate-distortion limits of online KV cache compression in autoregressive language models, formulating it as sequential Wyner-Ziv source coding on the filtration induced by the model, with the next-step query as decoder side information. Empirically, across four models spanning two families and $0.5$-$3$B parameters, we find that the next-token distribution's sensitivity to context truncation decays \emph{polynomially} rather than \emph{geometrically}: a power law improves on an exponential fit by an order of magnitude in extrapolation, the fitted exponent is recovered independently from a sink-plus-recent KL measurement, and the decay is verified to be free of positional-encoding artifacts by a position-preserving ablation. Under a corresponding \emph{polynomial truncation-sensitivity} assumption, our main result characterizes the per-token memory requirement of \emph{suffix-only} cache policies: a sliding-window scheme attains distortion $\varepsilon$ with window $w = O(\varepsilon^{-1/α})$, and -- under an additional two-sided Bayes-risk condition -- a converse shows $w = Ω(\varepsilon^{-1/α})$ is necessary within this policy class, so the scaling is $Θ(\varepsilon^{-1/α})$ for suffix-only policies. Whether recurrent or propagating cache summaries can beat this scaling is left open. An explicit block-Markov scheme achieves the upper bound; its rate-of-convergence exponent matches the converse under additional forward-decay and regularity hypotheses (not implied by truncation sensitivity alone), and differs by a factor of two otherwise. Empirically, the polynomial law predicts the degradation curves of concrete cache policies: recency-based eviction (sliding, sink-plus-recent) suppresses distortion by roughly two orders of magnitude over random retention at equal budget, with a power-law decay in the budget.

View on arXiv PDF

Similar