Method Drift›KV-cache compression
InfLLM
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context MemoryKV-cache compression · first seen Feb 7, 2024
superseded — cited as a baseline and beaten by newer methods
5 papers critique it · 4 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites InfLLM as a baseline.
“However, due to its sub-optimal block-level selection, it results in lower performance on most tasks compared to TokenSelect, even though we set a larger token budget for InfLLM.”
— TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection“They require careful hyperparameter tuning (e.g., chunk size in InfLLM~xiao2024infllm, or ANN index construction in RetrievalAttention~liu2024retrievalattention) and must retain the full KV cache as a candidate pool, limiting memory savings.”
— LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference“this increases latency because of the newly introduced retrieval overhead which was not present in legacy methods”
— More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression“Though the block-level space-continuity assumption improves efficiency, it does not align with real scenarios where relevant tokens are distributed discretely, leading to a significant drop in model quality.”
— PQCache: Product Quantization-based KVCache for Long Context LLM Inference“Although CPU offloading mitigates GPU memory limitations, existing approaches~xiao2024infllm,zhang2024pqcache still require retrieving a substantial portion of tokens (around 20\%), introducing significant decoding latency overheads due to slow data transfer between CPU RAM and GPU RAM.”
— TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization
Beaten on benchmarks
Head-to-head results where a newer method reports beating InfLLM. Values are copied from the source paper's tables — verify against the cited paper.
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats InfLLM · Avg. [Qwen2-7B]
49.08 vs 32.06
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats InfLLM · Avg. [Llama-3-8B]
43.90 vs 39.23
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats InfLLM · Avg. [Yi-1.5-6B]
36.77 vs 33.76
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats InfLLM · Average [Qwen2-7B]
43.64 vs 42.90
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats InfLLM · Average [Yi-1.5-6B]
36.02 vs 33.25
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats InfLLM · Avg. [Qwen2-7B (4K+4K)]
75.17 vs 29.82
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats InfLLM · Avg. [Llama-3-8B (4K+4K)]
66.63 vs 42.50
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats InfLLM · Avg. [Yi-1.5-6B (2K+512)]
48.93 vs 31.83
- LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
SAGE-KV (Ours) beats InfLLM · Average [Llama3.1-8B-Instruct (128k)]
52.49 vs 50.29
- LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
SAGE-KV (Ours) beats InfLLM · Average [Llama-3-8B-ProLong-512k-Instruct]
47.64 vs 43.08
- LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
SAGE-KV (Ours) beats InfLLM · Average [Qwen2.5-7B-Instruct (128k)]
51.19 vs 47.46
- Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver (offloaded) beats InfLLM · Avg F1 [KV offloading, 15% budget, LongBench, Llama-3.1-8B-Instruct]
38.9 vs 26.2
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- May 28, 2026
- May 18, 2026
- LouverSparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV CacheMay 7, 2026
- Apr 12, 2026
- ScoutAttentionScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM InferenceMar 28, 2026
- DynSplit-KVDynSplit-KV: Dynamic Semantic Splitting for KVCache Compression in Efficient Long-Context LLM InferenceFeb 3, 2026
- HeteroCacheHeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM InferenceJan 20, 2026
- Dec 11, 2025
- CLOCLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-DesignNov 18, 2025
- Oct 13, 2025