LGAIApr 17

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

arXiv:2604.1322627.02 citationsh-index: 19
Predicted impact top 25% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For LLM inference, KV Packet eliminates the computational overhead of context-dependent KV recomputation, enabling efficient cache reuse without performance loss.

KV Packet introduces a recomputation-free KV caching framework that uses trainable soft-token adapters to reuse cached documents across contexts, achieving near-zero FLOPs and lower TTFT while retaining F1 scores comparable to full recomputation on Llama-3.1 and Qwen2.5.

Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution. Existing solutions such as CacheBlend, EPIC, and SAM-KV mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs) and increased Time-to-First-Token (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable ``packets'' wrapped in light-weight trainable soft-token adapters, which are trained via self-supervised distillation to bridge context discontinuities. Experiments on Llama-3.1 and Qwen2.5 demonstrate that the proposed KV Packet method achieves near-zero FLOPs and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes