Method Drift

Living systematic review

KV-cache compression

Cutting the memory and bandwidth cost of the transformer key-value cache in long-context LLM inference — token eviction, quantization/low-rank, offload/reuse, and head/layer-adaptive budgeting.

264 papers · 613 critique receipts · 2,449 benchmark results · updated Jun 18, 2026

Most-superseded baselines

Ranked by how many distinct papers critique or beat each method. These are the standard baselines newer work routinely measures against.

  1. 1
    SnapKV· SnapKV
    SnapKV: LLM Knows What You are Looking for Before Generation

    51 papers critique it · 71 beat it on benchmarks

  2. 3
    StreamingLLM· SnapKV
    Efficient Streaming Language Models with Attention Sinks

    43 papers critique it · 44 beat it on benchmarks

  3. 5
    KIVI· KIVI
    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    20 papers critique it · 27 beat it on benchmarks

  4. 6
    Quest· Quest
    Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

    13 papers critique it · 16 beat it on benchmarks

  5. 7
    TOVA· SnapKV
    Transformers are Multi-State RNNs

    6 papers critique it · 14 beat it on benchmarks

  6. 9
    CaM· SnapKV

    6 papers critique it · 6 beat it on benchmarks

  7. 11

Sub-problems

Methods that compete on the same benchmarks cluster into distinct sub-problems.

SnapKV · 133 methods

SnapKV · H2O · StreamingLLM · PyramidKV · TOVA · AdaKV

KIVI · 66 methods

KIVI · KVQuant · TurboQuant · RTN · GEAR · QuaRot

Quest · 57 methods

Quest · ShadowKV · InfiniGen · DuoAttention · InfLLM · FlexGen

Palu · 30 methods

Palu · ThinK · Eigen Attention · PagedAttention · Loki · Lexico

MiniCache · 23 methods

MiniCache · CacheBlend · TurboRAG · EPIC · Mooncake · PromptCache

ReKV · 27 methods

ReKV · FastV · InfiniPot-V · SparseVLM · InfiniPot · LiveVLM

The frontier

Recent methods not yet superseded in the knowledge base.