GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs
This work provides an incremental improvement for LLM developers and researchers working on long-context models by reducing memory overhead and improving performance.
The paper addresses the problem of imbalanced merging in KV cache compression for long-context LLMs, which arises from span-based retention methods. They propose GRKV, a training-free method that uses ridge regression to distribute information from evicted tokens, resulting in improved overall performance on LongBench and RULER benchmarks.
Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compression methods that enforce a fixed budget through eviction and merging. Modern eviction methods increasingly adopt span-based retention because preserving contiguous spans is empirically effective and better preserves semantic coherence. Yet, when combined with post-eviction merging, span-based retention concentrates merges onto a small set of span-boundary carrier tokens, producing a highly imbalanced merge pattern that exacerbates over-merging and increases information loss. To address this imbalance, we propose GRKV (Global Regression for KV Cache), a training-free KV-cache merging method that directly minimizes the discrepancy between compressed-cache and full-cache attention outputs. GRKV uses ridge-regression-based merge steps to distribute information from evicted tokens across retained tokens, while regularizing the updates to prevent over-smoothing. Across the LongBench and RULER long-context benchmarks, GRKV is the only merging method that improves overall performance with minimal overhead.