CLMar 22

Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

arXiv:2604.0327012.5h-index: 1

Predicted impact top 86% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For LLM practitioners, this offers a training-free method to reduce inference costs and add steerability, though the KV cache equivalence is fragile to formatting errors.

Knowledge Packs eliminate token costs in RAG by injecting pre-computed KV caches, achieving up to 95% token savings with zero output divergence across 700 questions on Qwen3-8B and Llama-3.1-8B, while also enabling behavioral steering via contrastive value deltas.

RAG wastes tokens. We propose Knowledge Packs: pre-computed KV caches that deliver the same knowledge at zero token cost. For causal transformers, the KV cache from a forward pass on text F is identical to what a joint pass on F+q would produce - this follows directly from the causal mask. The equivalence is exact but fragile: wrong chat template formatting causes 6-7pp degradation, which we believe explains prior claims of KV outperforming RAG. With correct formatting: zero divergences across 700 questions on Qwen3-8B and Llama-3.1-8B, up to 95% token savings. The KV interface also enables behavioral steering that RAG cannot do. Because RoPE rotates keys but leaves values untouched, contrastive deltas on cached values can nudge model behavior while key arithmetic destroys coherence. The effect sits in mid-layer values (33-66%), independent directions are nearly orthogonal (cos~0) and compose, and both channels - knowledge and steering - run simultaneously at alpha<=0.7 without interference. No training, no weight modification.

View on arXiv PDF

Similar