LG AIMay 9

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

arXiv:2605.1636080.5

AI Analysis

For LLM practitioners needing efficient long-context inference, ProxyKV offers a practical solution that reduces prefilling overhead without significant accuracy loss.

ProxyKV addresses the trade-off between low-latency heuristics and high-precision reconstruction in KV cache pruning for long-context LLM inference, achieving ~98.7% of KVZip's mean accuracy while delivering up to 3.21× prefilling speedup on Llama-3.1-8B and sustaining speedup at contexts up to 170k tokens on Qwen-2.5-7B.

Efficient long-context inference in Large Language Models (LLMs) is severely constrained by the Key-Value (KV) cache memory wall, yet existing pruning methods force a choice between low-latency heuristics that sacrifice precision and high-precision reconstruction methods that incur prohibitive prefilling overhead. To bridge this scoring-cost--accuracy gap, we propose ProxyKV, a cross-model proxy pruning framework that offloads importance scoring to a lightweight intra-family Small-Model Proxy executed asynchronously to the Large-Model Target. To bridge the architectural gap between heterogeneous models, we design the HybridAxialMapper, which disentangles temporal feature extraction from cross-head alignment, together with a Multi-Granularity Hybrid Loss that shifts the learning objective from rigid regression to relative ranking consistency. Across the Llama-3.1, Qwen-2.5, and Qwen-3 families spanning targets from 7B up to 32B parameters on LongBench, SCBench, and RULER, ProxyKV matches KVZip on aggregate (recovering $\sim$$98.7\%$ of its mean accuracy) while delivering up to a $3.21\times$ prefilling speedup on Llama-3.1-8B (dual-GPU; $\sim$$1.5\times$ shared single-GPU) and sustaining the speedup at contexts up to 170k tokens on Qwen-2.5-7B.

View on arXiv PDF

Similar