Method Drift›KV-cache compression
Quest
Quest: Query-Aware Sparsity for Efficient Long-Context LLM InferenceKV-cache compression · first seen Jun 16, 2024
superseded — cited as a baseline and beaten by newer methods
13 papers critique it · 16 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites Quest as a baseline.
“this line of work does not mitigate the memory footprint, thereby limiting the batch size and preventing accommodation of extremely long contexts (e.g., 1M tokens)”
— ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference“Despite the relatively low overhead, Quest lacks sophisticated design in the retrieval strategy, thus suffers from noticeable performance degradation.”
— A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization“This reduces compute and I/O while mostly preserving accuracy, though memory use remains unchanged.”
— FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management“Instead of leveraging the attention patterns of previous tokens, these methods build specialized kernel to approximate attention and identify critical tokens.”
— RefreshKV: Updating Small KV Cache During Long-form Generation“Quest is sensitive to the page size, and the accuracy significantly drops with large page sizes and small budgets, as shown in fig:block_size.”
— AttentionPredictor: Temporal Patterns Matter for KV Cache Compression“Quest stores the entire KV cache in GPU memory with limited capacity, restricting support for long context lengths and large batch sizes.”
— FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference“However, methods like Quest~tang2024quest and SparQ~ribar2024sparq encounter memory limitations when attempting to store all tokens on the GPU.”
— TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization“However, it fails to reduce memory usage and suffers from accuracy degradation.”
— HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference“While effective, most methods either discard unused tokens too early or require full cache for scoring.”
— PiKV: KV Cache Management System for Mixture of Experts“selective loading fails to reduce the memory footprint”
— MemShare: Memory Efficient Inference for Large Reasoning Models through KV Cache Reuse“since pages are divided simply by textual positions of tokens, internal fragmentation becomes an issue: a recalled page may contain unimportant tokens, wasting budget that could be allocated to truly important tokens”
— ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression“In fact, to efficiently perform page selection, the method requires storing additional page representations, resulting in a slight memory overhead rather than savings.”
— Inference-Time Hyper-Scaling with KV Cache Compression
Beaten on benchmarks
Head-to-head results where a newer method reports beating Quest. Values are copied from the source paper's tables — verify against the cited paper.
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Quest · Avg (RULER) [Llama-3-8B-1M]
86.88 vs 82.03
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Quest · Avg (LongBench) [Llama-3-8B-1M]
39.94 vs 36.65
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Quest · Avg (RULER) [GLM-4-9B-1M]
85.62 vs 77.86
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Quest · Avg (LongBench) [GLM-4-9B-1M]
47.89 vs 41.52
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Quest · Avg (RULER) [Llama-3.1-8B]
83.57 vs 76.29
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Quest · Avg (LongBench) [Llama-3.1-8B]
48.13 vs 44.80
- A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization
A$^2$ATS beats Quest · Accuracy [Llama-3.1-8B-Instruct, Sparsity ~0.060]
86.6 vs 80.7
- A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization
A$^2$ATS beats Quest · Accuracy [MegaBeam-Mistral-7B-512K, Sparsity ~0.062]
86.3 vs 78.4
- LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences
LouisKV beats Quest · AIME [Qwen3-8B]
0.66 vs 0.60
- RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
RocketKV beats Quest · LB Avg. [Token Budget 256]
51.1 vs 17.8
- RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
RocketKV beats Quest · NIAH [Token Budget 256]
100.0 vs 10.7
- RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
RocketKV beats Quest · Avg. [Token Budget 1024]
44.3 vs 15.7
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- May 28, 2026
- May 18, 2026
- LouverSparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV CacheMay 7, 2026
- Apr 12, 2026
- ScoutAttentionScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM InferenceMar 28, 2026
- DynSplit-KVDynSplit-KV: Dynamic Semantic Splitting for KVCache Compression in Efficient Long-Context LLM InferenceFeb 3, 2026
- HeteroCacheHeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM InferenceJan 20, 2026
- Dec 11, 2025
- CLOCLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-DesignNov 18, 2025
- Oct 13, 2025