CLMar 23, 2025

WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference

Youhui Zuo, Sibo Wei, Chen Zhang, Zhuorui Liu, Wenpeng Lu, Dawei Song

arXiv:2503.17922v22.71 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This addresses efficient LLM inference for industrial applications by reducing memory usage, though it is incremental as it builds on existing KV cache compression techniques.

The paper tackles the problem of high GPU memory consumption from KV cache in long-context LLM inference by proposing WindowKV, a task-adaptive window selection method that retains only 12% of the original KV cache while maintaining performance comparable to full retention on benchmarks like LongBench and achieving state-of-the-art results in Needle-in-a-Haystack.

With the advancements in long-context inference capabilities of large language models (LLMs), the KV cache has become one of the foundational components. However, its substantial GPU memory consumption makes KV cache compression a key technique for enabling efficient LLM inference in industrial scenarios. While recent studies have focused on optimizing the memory occupied by the KV cache, they overlook two critical factors: preserving semantic coherence and considering task-specific characteristic during compression. To address these limitations, we propose a novel task-adaptive KV cache window selection method, WindowKV. WindowKV dynamically selects local semantic windows consisting of consecutive tokens, according to task-specific characteristics, ensuring the retained KV cache captures continuous, essential context. Additionally, we introduce an intra-group layer KV cache indices sharing strategy to reduce computational overhead, achieving a balance between performance and efficiency. We rigorously evaluate WindowKV on the LongBench benchmark, and the results demonstrate that it maintains a performance comparable to full KV cache retention while using only 12% of the original KV cache, significantly reducing memory requirements. Furthermore, our method also achieves state-of-the-art results in the Needle-in-a-Haystack evaluation, highlighting its effectiveness and robustness.

View on arXiv PDF Code

Similar