CLMar 2

Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

Samhruth Ananthanarayanan, Ayan Sengupta, Tanmoy Chakraborty

arXiv:2603.01426v12.13 citationsh-index: 27

Originality Incremental advance

AI Analysis

This addresses the memory bottleneck in LLMs with long contexts, offering insights into compression tolerance and scalability, though it is incremental as it builds on existing compression methods.

The paper tackled the problem of key-value (KV) cache compression in large language models (LLMs) by analyzing attention dynamics, revealing that moderate compression degrades internal representations with little accuracy loss, and all models exhibit a sharp hallucination safety cliff near 90% compression, correlated with spikes in Global Eviction Ratio (GER).

As context windows in LLMs scale to 100K+ tokens, the key-value (KV) cache becomes the dominant memory bottleneck, with recent methods claiming 80-90% savings and minimal benchmark degradation. We argue these evaluations miss a structural issue: attention is not just storage but routing, and retaining KV pairs does not guarantee semantic accessibility. We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing, distinguishing retention, accessibility, and utilization. Using synthetic tasks probing multi-entity tracking, disambiguation, coreference, and multi-hop reasoning, we find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy; all models exhibit a sharp hallucination safety cliff near 90% compression, correlated with spikes in Global Eviction Ratio (GER), suggesting a phase transition in semantic reachability; and architectures differ in routing dynamics, with LLaMA showing early consensus and late diversification, and Qwen showing funnel-like late convergence, leading to distinct resilience profiles. Beyond erasure, we identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival. These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.

View on arXiv PDF

Similar