Quest (KV-cache compression): superseded — cited as a baseline and beaten by newer methods. 13 paper(s) critique it, 16 beat it on benchmarks — #6 of 234 most-superseded. Sub-problem: cluster led by Quest. Newer alternatives in the same sub-problem include ParisKV, KVDrive, Louver, IceCache, ScoutAttention.

Superseded baseline#6 of 234 most-superseded

Quest

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

KV-cache compression · first seen Jun 16, 2024

superseded — cited as a baseline and beaten by newer methods

13 papers critique it · 16 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites Quest as a baseline.

“this line of work does not mitigate the memory footprint, thereby limiting the batch size and preventing accommodation of extremely long contexts (e.g., 1M tokens)”
— ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
“Despite the relatively low overhead, Quest lacks sophisticated design in the retrieval strategy, thus suffers from noticeable performance degradation.”
— A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization
“This reduces compute and I/O while mostly preserving accuracy, though memory use remains unchanged.”
— FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management
“Instead of leveraging the attention patterns of previous tokens, these methods build specialized kernel to approximate attention and identify critical tokens.”
— RefreshKV: Updating Small KV Cache During Long-form Generation
“Quest is sensitive to the page size, and the accuracy significantly drops with large page sizes and small budgets, as shown in fig:block_size.”
— AttentionPredictor: Temporal Patterns Matter for KV Cache Compression
“Quest stores the entire KV cache in GPU memory with limited capacity, restricting support for long context lengths and large batch sizes.”
— FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
“However, methods like Quest~tang2024quest and SparQ~ribar2024sparq encounter memory limitations when attempting to store all tokens on the GPU.”
— TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization
“However, it fails to reduce memory usage and suffers from accuracy degradation.”
— HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference
“While effective, most methods either discard unused tokens too early or require full cache for scoring.”
— PiKV: KV Cache Management System for Mixture of Experts
“selective loading fails to reduce the memory footprint”
— MemShare: Memory Efficient Inference for Large Reasoning Models through KV Cache Reuse
“since pages are divided simply by textual positions of tokens, internal fragmentation becomes an issue: a recalled page may contain unimportant tokens, wasting budget that could be allocated to truly important tokens”
— ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
“In fact, to efficiently perform page selection, the method requires storing additional page representations, resulting in a slight memory overhead rather than savings.”
— Inference-Time Hyper-Scaling with KV Cache Compression

Beaten on benchmarks

Head-to-head results where a newer method reports beating Quest. Values are copied from the source paper's tables — verify against the cited paper.

Sys beats Quest · Avg (RULER) [Llama-3-8B-1M]
86.88 vs 82.03
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Quest · Avg (LongBench) [Llama-3-8B-1M]
39.94 vs 36.65
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Quest · Avg (RULER) [GLM-4-9B-1M]
85.62 vs 77.86
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Quest · Avg (LongBench) [GLM-4-9B-1M]
47.89 vs 41.52
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Quest · Avg (RULER) [Llama-3.1-8B]
83.57 vs 76.29
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Quest · Avg (LongBench) [Llama-3.1-8B]
48.13 vs 44.80
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
A$^2$ATS beats Quest · Accuracy [Llama-3.1-8B-Instruct, Sparsity ~0.060]
86.6 vs 80.7
A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization
A$^2$ATS beats Quest · Accuracy [MegaBeam-Mistral-7B-512K, Sparsity ~0.062]
86.3 vs 78.4
A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization
LouisKV beats Quest · AIME [Qwen3-8B]
0.66 vs 0.60
LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences
RocketKV beats Quest · LB Avg. [Token Budget 256]
51.1 vs 17.8
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
RocketKV beats Quest · NIAH [Token Budget 256]
100.0 vs 10.7
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
RocketKV beats Quest · Avg. [Token Budget 1024]
44.3 vs 15.7
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.