LG AIFeb 9

CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation

Ning Yang, Chengzhi Wang, Yibo Liu, Baoliang Tian, Haijun Zhang

arXiv:2602.08686v11.4h-index: 1

Originality Highly original

AI Analysis

This addresses memory efficiency for LLM deployment in long-context applications, representing a strong specific gain rather than a foundational advance.

The paper tackled the problem of KV cache memory constraints in large language models for long-context scenarios by proposing CompilerKV, a risk-adaptive and head-aware compression framework, which recovered 97.7% of FullKV performance and achieved up to a 5.2-point gain over competitors under a 512-token budget.

Large Language Models (LLMs) in long-context scenarios are severely constrained by the linear growth of Key-Value (KV) cache memory. Existing KV compression methods rely either on static thresholds and attention-only heuristics or on coarse memory budget allocation. Under tight memory budgets, these methods overlook two key factors: prompt-dependent variation in compression risk and functional heterogeneity across attention heads, which destabilize token selection and lead to tail failures. To address these challenges, we propose CompilerKV, a risk-adaptive and head-aware compression framework that compiles offline experience into reusable decision tables for prefill-only deployment. CompilerKV integrates two key synergistic components: (i) a Head Heterogeneity Table, learned via offline contextual bandits, which assigns head-specific reliability weights to govern functional differences across attention heads explicitly; and (ii) a Risk-Adaptive Threshold Gating mechanism that jointly models attention entropy and local perplexity, transforming prompt-level risk into deployable retention thresholds. Experiments on LongBench show CompilerKV dominates SOTA methods under a 512-token budget, recovering 97.7\% of FullKV performance while achieving up to +5.2 points gain over the strongest competitor.

View on arXiv PDF

Similar