Is InfLLM superseded?

InfLLM (KV-cache compression): superseded — cited as a baseline and beaten by newer methods. 5 paper(s) critique it, 4 beat it on benchmarks — #18 of 234 most-superseded. Sub-problem: cluster led by Quest. Newer alternatives in the same sub-problem include ParisKV, KVDrive, Louver, IceCache, ScoutAttention.

Method Drift›KV-cache compression

Superseded baseline#18 of 234 most-superseded

InfLLM

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

KV-cache compression · first seen Feb 7, 2024

superseded — cited as a baseline and beaten by newer methods

5 papers critique it · 4 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites InfLLM as a baseline.

“However, due to its sub-optimal block-level selection, it results in lower performance on most tasks compared to TokenSelect, even though we set a larger token budget for InfLLM.”
— TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
“They require careful hyperparameter tuning (e.g., chunk size in InfLLM~xiao2024infllm, or ANN index construction in RetrievalAttention~liu2024retrievalattention) and must retain the full KV cache as a candidate pool, limiting memory savings.”
— LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
“this increases latency because of the newly introduced retrieval overhead which was not present in legacy methods”
— More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression
“Though the block-level space-continuity assumption improves efficiency, it does not align with real scenarios where relevant tokens are distributed discretely, leading to a significant drop in model quality.”
— PQCache: Product Quantization-based KVCache for Long Context LLM Inference
“Although CPU offloading mitigates GPU memory limitations, existing approaches~xiao2024infllm,zhang2024pqcache still require retrieving a substantial portion of tokens (around 20\%), introducing significant decoding latency overheads due to slow data transfer between CPU RAM and GPU RAM.”
— TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization

Beaten on benchmarks

Head-to-head results where a newer method reports beating InfLLM. Values are copied from the source paper's tables — verify against the cited paper.

TokenSelect beats InfLLM · Avg. [Qwen2-7B]
49.08 vs 32.06
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats InfLLM · Avg. [Llama-3-8B]
43.90 vs 39.23
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats InfLLM · Avg. [Yi-1.5-6B]
36.77 vs 33.76
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats InfLLM · Average [Qwen2-7B]
43.64 vs 42.90
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats InfLLM · Average [Yi-1.5-6B]
36.02 vs 33.25
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats InfLLM · Avg. [Qwen2-7B (4K+4K)]
75.17 vs 29.82
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats InfLLM · Avg. [Llama-3-8B (4K+4K)]
66.63 vs 42.50
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect beats InfLLM · Avg. [Yi-1.5-6B (2K+512)]
48.93 vs 31.83
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
SAGE-KV (Ours) beats InfLLM · Average [Llama3.1-8B-Instruct (128k)]
52.49 vs 50.29
LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
SAGE-KV (Ours) beats InfLLM · Average [Llama-3-8B-ProLong-512k-Instruct]
47.64 vs 43.08
LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
SAGE-KV (Ours) beats InfLLM · Average [Qwen2.5-7B-Instruct (128k)]
51.19 vs 47.46
LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
Louver (offloaded) beats InfLLM · Avg F1 [KV offloading, 15% budget, LongBench, Llama-3.1-8B-Instruct]
38.9 vs 26.2
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.