Method Drift›KV-cache compression
FlexGen
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPUKV-cache compression · first seen Mar 13, 2023
superseded — cited as a baseline and beaten by newer methods
7 papers critique it · 2 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites FlexGen as a baseline.
“FlexGen ... and PipeSwitch ... attempt to overlap GPU computation of the current layer with KV cache loading for the next layer. However, the effectiveness of such an overlap is capped by the task that takes the longest time. In most systems, PCIe transfer time overshadows GPU computation latency, particularly with large batch and context sizes. Hence, fully overlapping GPU computation with PCIe transfer time is infeasible.”
— KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation“Although this mitigates GPU memory pressure, it significantly degrades inference performance due to data transfer latency and complex scheduling overhead.”
— KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache“its execution model necessitates loading the entire KV cache from off-chip storage during every generation step. This heavily I/O-bound approach incurs severe latency penalties, causing the throughput to plummet to less than 1 token/s.”
— KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference“sheng2023flexgen,zhao2023atom quantized KV cache activations to 4-bits, but required fine-grained grouping for 4-bit quantization, while still observing some perplexity degradation”
— KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization“FlexGen explores offloading strategies between GPU, CPU, and disk storage, but suffers from the high latency of PCIe transfers (typically 8-12GB/s) compared to GPU HBM bandwidth (>2TB/s).”
— CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving“FlexGen~sheng2023flexgen demonstrated CPU+disk offloading with static policies”
— Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference“It does not consider SLO constraints or reconfigure at runtime.”
— OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration
Beaten on benchmarks
Head-to-head results where a newer method reports beating FlexGen. Values are copied from the source paper's tables — verify against the cited paper.
- KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
KVPR beats FlexGen · Throughput [OPT-6.7B, seq_len 256, gen_len 32]
53.976 vs 50.057
- KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
KVPR beats FlexGen · Throughput [OPT-6.7B, seq_len 256, gen_len 128]
49.860 vs 46.779
- KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
KVPR beats FlexGen · Throughput [OPT-6.7B, seq_len 512, gen_len 32]
33.666 vs 29.614
- KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
KVPR beats FlexGen · Throughput [OPT-6.7B, seq_len 512, gen_len 128]
32.277 vs 28.650
- KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
KVPR beats FlexGen · Throughput [OPT-6.7B, seq_len 1024, gen_len 32]
18.285 vs 15.778
- KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
KVPR beats FlexGen · Throughput [OPT-6.7B, seq_len 1024, gen_len 128]
18.108 vs 16.194
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
KVQuant beats FlexGen · Perplexity (PPL) [LLaMA-7B 4-bit]
5.69 vs 5.73
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
KVQuant beats FlexGen · Perplexity (PPL) [LLaMA-7B 3-bit]
5.75 vs 5.93
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
KVQuant beats FlexGen · Perplexity (PPL) [LLaMA-7B 2-bit]
6.01 vs 11.09
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
KVQuant beats FlexGen · Perplexity (PPL) [LLaMA-13B 4-bit]
5.10 vs 5.14
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
KVQuant beats FlexGen · Perplexity (PPL) [LLaMA-13B 3-bit]
5.14 vs 5.29
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- May 28, 2026
- May 18, 2026
- LouverSparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV CacheMay 7, 2026
- Apr 12, 2026
- ScoutAttentionScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM InferenceMar 28, 2026
- DynSplit-KVDynSplit-KV: Dynamic Semantic Splitting for KVCache Compression in Efficient Long-Context LLM InferenceFeb 3, 2026
- HeteroCacheHeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM InferenceJan 20, 2026
- Dec 11, 2025
- CLOCLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-DesignNov 18, 2025
- Oct 13, 2025