CL AIMay 30, 2025

R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar

Microsoft

arXiv:2505.24133v322.221 citationsh-index: 19Has Code

Originality Incremental advance

AI Analysis

This addresses a critical bottleneck for deploying efficient reasoning models in resource-constrained environments, though it is incremental as it builds on existing KV cache compression approaches.

The paper tackles the problem of large key-value (KV) caches in reasoning models during inference, which cause memory and throughput issues, by proposing R-KV, a redundancy-aware compression method that preserves nearly 100% of performance with only 10% of the KV cache and achieves 105% performance with 16% cache, while saving 90% memory and increasing throughput by 6.6x.

Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This KV-cache reduction also leads to a 90% memory saving and a 6.6X throughput over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.

View on arXiv PDF Code

Similar