Is InfiniGen superseded?

InfiniGen (KV-cache compression): superseded — cited as a baseline and beaten by newer methods. 6 paper(s) critique it, 4 beat it on benchmarks — #15 of 234 most-superseded. Sub-problem: cluster led by Quest. Newer alternatives in the same sub-problem include ParisKV, KVDrive, Louver, IceCache, ScoutAttention.

Method Drift›KV-cache compression

Superseded baseline#15 of 234 most-superseded

InfiniGen

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management

KV-cache compression · first seen Jun 28, 2024

superseded — cited as a baseline and beaten by newer methods

6 papers critique it · 4 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites InfiniGen as a baseline.

“AGX+InfiniGen and AGX+InfiniGenP are even slower than the FlexGen baseline due to fine-grained, token-level selection introducing significant preprocessing overhead.”
— V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval
“this approach incurs significant overhead due to the latency of fetching the selected sparse KV pairs from the CPU during decoding”
— ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
“However, the estimation time increases significantly as the sequence grows, and the inference time for a single layer is insufficient to cover this.”
— AttentionPredictor: Temporal Patterns Matter for KV Cache Compression
“Prefetching approaches, including FlexGen and InfiniGen, help alleviate PCIe data transfer latency but face limitations in handling large-scale tasks or entail performance trade-offs.”
— CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design
“Although this approach mitigates the GPU memory capacity constraint, it introduces a new I/O bottleneck. We observe that in InfiniGen, even with prefetching, slow I/O causes the GPU to stall for 61% of the end-to-end execution time, leading to a substantial performance degradation.”
— ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference
“InfiniGen's recall latency cannot be fully hidden due to its inefficient token-wise recall.”
— FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Beaten on benchmarks

Head-to-head results where a newer method reports beating InfiniGen. Values are copied from the source paper's tables — verify against the cited paper.

Sys beats InfiniGen · Avg (RULER) [Llama-3-8B-1M]
86.88 vs 70.13
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats InfiniGen · Avg (LongBench) [Llama-3-8B-1M]
39.94 vs 31.81
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats InfiniGen · Avg (RULER) [GLM-4-9B-1M]
85.62 vs 67.60
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats InfiniGen · Avg (LongBench) [GLM-4-9B-1M]
47.89 vs 41.64
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats InfiniGen · Avg (RULER) [Llama-3.1-8B]
83.57 vs 59.27
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats InfiniGen · Avg (LongBench) [Llama-3.1-8B]
48.13 vs 44.25
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
LiteCache + HATA beats InfiniGen · decode latency [Qwen2.5-14B, seq=128K, bsz=1]
60.78 vs 277.09
CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design
LiteCache + HATA beats InfiniGen · decode latency [Llama3-8B, seq=128K, bsz=1]
34.91 vs 161.09
CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design
FreeKV beats InfiniGen · Overall [Llama-3.1-8B-Instruct, LongBench v2]
29.22 vs 28.56
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
FreeKV beats InfiniGen · CR [Llama-3.1-8B-Instruct, LongGenBench]
78.03 vs 76.68
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
FreeKV beats InfiniGen · Overall [Qwen-2.5-7B-Instruct, LongBench v2]
26.84 vs 26.44
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
FreeKV beats InfiniGen · CR [Qwen-2.5-7B-Instruct, LongGenBench]
76.93 vs 72.96
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.