LG CLMay 9

Relative Kinetic Utility for Reasoning-Aware Structural Pruning in Large Language Models

arXiv:2605.0900855.2

AI Analysis

For LLM practitioners needing efficient inference, RKU addresses the collapse of reasoning at high sparsity, a known bottleneck in structural pruning.

The paper tackles inference latency and KV cache bottlenecks in LLMs caused by long CoT sequences, proposing Relative Kinetic Utility (RKU) for structural pruning. At 40% sparsity, RKU achieves 13.34% accuracy on GSM8K, outperforming baselines and preserving reasoning under distribution shift.

Chain-of-Thought (CoT) prompting symbolized a huge improvement of reasoning capabilities of Large Language Models (LLMs). However, scaling up test-time computation yields extensive CoT sequences, introducing severe inference latency and key-value (KV) cache memory bottlenecks. While structural pruning offers a fundamental, hardware-aware solution to alleviate static parameter burdens, existing magnitude-based methods may cut off the neurons of CoT: by over-indexing on discrete cross-entropy objectives, these heuristics fall into a \textit{magnitude trap}: they prioritize high-frequency, low-information syntactic tokens and trigger a disappointing reasoning collapse at high sparsities (e.g., 40\%). To overcome this topological phase transition, we propose \textsc{Relative Kinetic Utility} (RKU), a novel theoretical framework that elevates discrete pruning to a continuous kinetic integral over the depth manifold of the model based on Alternating Gradient Flow(AGF). By modifying it with Fisher trace normalization, RKU acts as a lightweight curvature-aware normalization to isolate \textit{kinetic spikes} -- the fundamental structural pathways responsible for high-curvature logical routing. Extensive experiments on Qwen-2.5-7B and LLaMA-3-8B improves performance in the high-sparsity regime around 40\%. RKU attains 13.34\% accuracy on GSM8K at 40\% sparsity, outperforming the strongest baseline, and appears to better preserve reasoning-relevant representations under out-of-distribution evaluation.

View on arXiv PDF

Similar