Method Drift›KV-cache compression
TOVA
Transformers are Multi-State RNNsKV-cache compression · first seen Jan 11, 2024
superseded — cited as a baseline and beaten by newer methods
6 papers critique it · 14 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites TOVA as a baseline.
“these methods require access to the full attention matrix, making them incompatible with Flash Attention~flashattention and thus impractical for modern deployment scenarios”
— Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution“They fix the budget of KV Cache in a finite level, but don't distinguish the differences between layers and between heads.”
— LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation“These methods, however, often overlook the structure of key information distribution by naively evicting tokens across the entire sequence.”
— TreeKV: Smooth Key-Value Cache Compression with Tree Structures“While effective, most methods either discard unused tokens too early or require full cache for scoring.”
— PiKV: KV Cache Management System for Mixture of Experts“However, these methods rely primarily on attention weights and often overlook the contribution of value states in shaping the final model outputs.”
— OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference“TOVA~oren2024tova retains attention sinks and a sliding window of recent tokens; a credential at relative depth 0.5 sits 2,000 tokens outside the window.”
— Transactional Attention: Semantic Sponsorship for KV-Cache Retention
Beaten on benchmarks
Head-to-head results where a newer method reports beating TOVA. Values are copied from the source paper's tables — verify against the cited paper.
- Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
EA (ours) beats TOVA · score [Qwen, Ruler 4K, 50% compression]
94.7 vs 77.6
- Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
EA (ours) beats TOVA · score [Gemma, Ruler 4K, 50% compression]
92.7 vs 76.5
- Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
EA (ours) beats TOVA · score [Qwen, Ruler 16K, 50% compression]
92.7 vs 76.2
- Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
EA (ours) beats TOVA · score [Gemma, Ruler 16K, 50% compression]
76.6 vs 62.5
- Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
Expected Attention beats TOVA · score [Qwen, Longbench, 25% compression]
50.25 vs 48.14
- KV Cache Transform Coding for Compact Storage in LLM Inference
KVTC beats TOVA · LITM [Llama 3.1 8B]
99.3 vs 1.2
- KV Cache Transform Coding for Compact Storage in LLM Inference
KVTC beats TOVA · LITM [MN-Minitron 8B]
99.3 vs 0.3
- KV Cache Transform Coding for Compact Storage in LLM Inference
KVTC beats TOVA · LITM [Mistral NeMo 12B]
99.8 vs 8.7
- AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
AhaKV beats TOVA · Average [LLaMA3-8B-Inst]
41.63 vs 40.18
- AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
AhaKV beats TOVA · Average [Qwen2-7B-Inst]
41.84 vs 37.99
- AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
AhaKV beats TOVA · Average [LLAMA2-7B-Chat]
26.78 vs 24.66
- AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
AhaKV beats TOVA · Average [Gemma-7B-Inst]
33.08 vs 30.80
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- STaR-KVSTaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language ModelsJun 1, 2026
- May 29, 2026
- May 28, 2026
- May 26, 2026
- May 25, 2026
- CONF-KVCONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLMMay 24, 2026
- May 21, 2026
- May 12, 2026
- Global Retention-Based KV EvictionMake Each Token Count: Towards Improving Long-Context Performance with KV Cache EvictionMay 10, 2026
- ReST-KVReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal SmoothingMay 9, 2026
- May 8, 2026
- fixed-contract diagnosticWhen Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache CompressionMay 7, 2026