AIDec 8, 2025

SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

arXiv:2512.07993v11 citationsh-index: 33
Originality Incremental advance
AI Analysis

This addresses efficiency bottlenecks for deploying large reasoning models in real-world applications, though it is incremental as it builds on existing KV compression methods.

The paper tackles the problem of high key-value (KV) cache overhead in large reasoning models during chain-of-thought reasoning, which causes memory and throughput bottlenecks, by proposing SkipKV, a training-free method that selectively skips KV generation and storage at the sentence level, resulting in up to 26.7% improved accuracy, 1.6x fewer generation length, and 1.7x higher throughput compared to alternatives.

Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \textbf{SkipKV}, a \textbf{\textit{training-free}} KV compression method for selective \textit{eviction} and \textit{generation} operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a \textit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to $\mathbf{26.7}\%$ improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to $\mathbf{1.6}\times$ fewer generation length while improving throughput up to $\mathbf{1.7}\times$.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes