Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective

Jiaming Yang, Chenwei Tang, Liangli Zhen, Jiancheng Lv

arXiv:2604.2597595.4

AI Analysis

For LLM practitioners dealing with long-context inference, this work provides a theoretically grounded eviction policy that replaces heuristic approaches, improving memory efficiency without sacrificing output quality.

This paper addresses the memory bottleneck of KV caching in long-context LLM inference by proposing CapKV, a capacity-aware eviction method grounded in the Information Bottleneck principle. CapKV consistently outperforms prior methods, achieving a better trade-off between memory efficiency and generation fidelity across multiple models and benchmarks.

Key-value (KV) caching is essential for large language model inference, yet its memory overhead poses a critical bottleneck for long-context generation. Existing eviction policies predominantly rely on empirical heuristics, lacking a rigorous theoretical foundation. This work rethinks KV cache eviction through the lens of the Information Bottleneck principle. Under a linear-Gaussian surrogate of attention, we derive a closed-form mutual information objective that characterizes the effective information capacity of a retained KV cache subset. This formulation reveals that a wide range of existing eviction strategies can be interpreted as different approximations of the same capacity-maximization principle. Guided by this insight, we introduce CapKV, a capacity-aware eviction method that directly targets information preservation via a log-determinant approximation using statistical leverage scores. This approach replaces heuristic selection with a theoretically grounded mechanism that preserves the maximum predictive signal. Extensive experiments across multiple models and long-context benchmarks show that CapKV consistently outperforms prior methods, achieving a better trade-off between memory efficiency and generational fidelity.

View on arXiv PDF

Similar