SONIC: Segmented Optimized Nexus for Information Compression in Key-Value Caching
This addresses the problem of memory constraints in multi-turn LLM deployment for efficient inference, representing a novel method for a known bottleneck.
The paper tackles the bottleneck of linear growth in Key-Value (KV) cache for multi-turn LLM deployment by proposing SONIC, a learning-based framework that compresses historical segments into compact Nexus tokens, achieving a 35.55% average score improvement on MTBench101 and accelerating inference by 50.1%.
The linear growth of Key-Value (KV) cache remains a bottleneck for multi-turn LLM deployment. Existing KV cache compression methods often fail to account for the structural properties of multi-turn dialogues, relying on heuristic eviction that risks losing critical context. We propose \textbf{SONIC}, a learning-based framework that compresses historical segments into compact and semantically rich \textbf{Nexus} tokens. By integrating dynamic budget training, SONIC allows flexible adaptation to varying memory constraints without retraining. Experiments show that at compression ratios of 80\% and 50\%, SONIC consistently outperforms baselines such as H2O and StreamingLLM on four diverse multi-turn benchmarks. Specifically, on the widely used MTBench101 benchmark, SONIC achieves an average score improvement of 35.55\% over state-of-the-art baselines, validating its effectiveness in sustaining coherent multi-turn dialogues. Furthermore, SONIC enhances deployment efficiency, accelerating the overall inference process by 50.1\% compared to full-context generation.