Is FlexGen superseded?

FlexGen (KV-cache compression): superseded — cited as a baseline and beaten by newer methods. 7 paper(s) critique it, 2 beat it on benchmarks — #19 of 234 most-superseded. Sub-problem: cluster led by Quest. Newer alternatives in the same sub-problem include ParisKV, KVDrive, Louver, IceCache, ScoutAttention.

Method Drift›KV-cache compression

Superseded baseline#19 of 234 most-superseded

FlexGen

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

KV-cache compression · first seen Mar 13, 2023

superseded — cited as a baseline and beaten by newer methods

7 papers critique it · 2 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites FlexGen as a baseline.

“FlexGen ... and PipeSwitch ... attempt to overlap GPU computation of the current layer with KV cache loading for the next layer. However, the effectiveness of such an overlap is capped by the task that takes the longest time. In most systems, PCIe transfer time overshadows GPU computation latency, particularly with large batch and context sizes. Hence, fully overlapping GPU computation with PCIe transfer time is infeasible.”
— KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
“Although this mitigates GPU memory pressure, it significantly degrades inference performance due to data transfer latency and complex scheduling overhead.”
— KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache
“its execution model necessitates loading the entire KV cache from off-chip storage during every generation step. This heavily I/O-bound approach incurs severe latency penalties, causing the throughput to plummet to less than 1 token/s.”
— KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference
“sheng2023flexgen,zhao2023atom quantized KV cache activations to 4-bits, but required fine-grained grouping for 4-bit quantization, while still observing some perplexity degradation”
— KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
“FlexGen explores offloading strategies between GPU, CPU, and disk storage, but suffers from the high latency of PCIe transfers (typically 8-12GB/s) compared to GPU HBM bandwidth (>2TB/s).”
— CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving
“FlexGen~sheng2023flexgen demonstrated CPU+disk offloading with static policies”
— Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference
“It does not consider SLO constraints or reconfigure at runtime.”
— OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration

Beaten on benchmarks

Head-to-head results where a newer method reports beating FlexGen. Values are copied from the source paper's tables — verify against the cited paper.

KVPR beats FlexGen · Throughput [OPT-6.7B, seq_len 256, gen_len 32]
53.976 vs 50.057
KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
KVPR beats FlexGen · Throughput [OPT-6.7B, seq_len 256, gen_len 128]
49.860 vs 46.779
KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
KVPR beats FlexGen · Throughput [OPT-6.7B, seq_len 512, gen_len 32]
33.666 vs 29.614
KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
KVPR beats FlexGen · Throughput [OPT-6.7B, seq_len 512, gen_len 128]
32.277 vs 28.650
KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
KVPR beats FlexGen · Throughput [OPT-6.7B, seq_len 1024, gen_len 32]
18.285 vs 15.778
KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
KVPR beats FlexGen · Throughput [OPT-6.7B, seq_len 1024, gen_len 128]
18.108 vs 16.194
KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
KVQuant beats FlexGen · Perplexity (PPL) [LLaMA-7B 4-bit]
5.69 vs 5.73
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
KVQuant beats FlexGen · Perplexity (PPL) [LLaMA-7B 3-bit]
5.75 vs 5.93
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
KVQuant beats FlexGen · Perplexity (PPL) [LLaMA-7B 2-bit]
6.01 vs 11.09
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
KVQuant beats FlexGen · Perplexity (PPL) [LLaMA-13B 4-bit]
5.10 vs 5.14
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
KVQuant beats FlexGen · Perplexity (PPL) [LLaMA-13B 3-bit]
5.14 vs 5.29
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.