Method Drift›KV-cache compression
InfiniGen
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache ManagementKV-cache compression · first seen Jun 28, 2024
superseded — cited as a baseline and beaten by newer methods
6 papers critique it · 4 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites InfiniGen as a baseline.
“AGX+InfiniGen and AGX+InfiniGenP are even slower than the FlexGen baseline due to fine-grained, token-level selection introducing significant preprocessing overhead.”
— V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval“this approach incurs significant overhead due to the latency of fetching the selected sparse KV pairs from the CPU during decoding”
— ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference“However, the estimation time increases significantly as the sequence grows, and the inference time for a single layer is insufficient to cover this.”
— AttentionPredictor: Temporal Patterns Matter for KV Cache Compression“Prefetching approaches, including FlexGen and InfiniGen, help alleviate PCIe data transfer latency but face limitations in handling large-scale tasks or entail performance trade-offs.”
— CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design“Although this approach mitigates the GPU memory capacity constraint, it introduces a new I/O bottleneck. We observe that in InfiniGen, even with prefetching, slow I/O causes the GPU to stall for 61% of the end-to-end execution time, leading to a substantial performance degradation.”
— ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference“InfiniGen's recall latency cannot be fully hidden due to its inefficient token-wise recall.”
— FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
Beaten on benchmarks
Head-to-head results where a newer method reports beating InfiniGen. Values are copied from the source paper's tables — verify against the cited paper.
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats InfiniGen · Avg (RULER) [Llama-3-8B-1M]
86.88 vs 70.13
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats InfiniGen · Avg (LongBench) [Llama-3-8B-1M]
39.94 vs 31.81
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats InfiniGen · Avg (RULER) [GLM-4-9B-1M]
85.62 vs 67.60
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats InfiniGen · Avg (LongBench) [GLM-4-9B-1M]
47.89 vs 41.64
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats InfiniGen · Avg (RULER) [Llama-3.1-8B]
83.57 vs 59.27
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats InfiniGen · Avg (LongBench) [Llama-3.1-8B]
48.13 vs 44.25
- CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design
LiteCache + HATA beats InfiniGen · decode latency [Qwen2.5-14B, seq=128K, bsz=1]
60.78 vs 277.09
- CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design
LiteCache + HATA beats InfiniGen · decode latency [Llama3-8B, seq=128K, bsz=1]
34.91 vs 161.09
- FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
FreeKV beats InfiniGen · Overall [Llama-3.1-8B-Instruct, LongBench v2]
29.22 vs 28.56
- FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
FreeKV beats InfiniGen · CR [Llama-3.1-8B-Instruct, LongGenBench]
78.03 vs 76.68
- FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
FreeKV beats InfiniGen · Overall [Qwen-2.5-7B-Instruct, LongBench v2]
26.84 vs 26.44
- FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
FreeKV beats InfiniGen · CR [Qwen-2.5-7B-Instruct, LongGenBench]
76.93 vs 72.96
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- May 28, 2026
- May 18, 2026
- LouverSparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV CacheMay 7, 2026
- Apr 12, 2026
- ScoutAttentionScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM InferenceMar 28, 2026
- DynSplit-KVDynSplit-KV: Dynamic Semantic Splitting for KVCache Compression in Efficient Long-Context LLM InferenceFeb 3, 2026
- HeteroCacheHeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM InferenceJan 20, 2026
- Dec 11, 2025
- CLOCLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-DesignNov 18, 2025
- Oct 13, 2025