Method Drift›KV-cache compression
Loki
Loki: Low-rank Keys for Efficient Sparse AttentionKV-cache compression · first seen Jun 4, 2024
superseded — cited as a baseline and beaten by newer methods
2 papers critique it · 2 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites Loki as a baseline.
“Approaches like SparQ Attention ribar2024sparqattentionbandwidthefficientllm, AQUA Attention s2025aquaattentionquerymagnitudes, and Loki singhania2024lokilowrankkeysefficient prioritize computational savings over memory reduction.”
— SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression“Another representative work is Loki~singhania2024loki, which performs the PCA transformation of KV cache and then selects % tokens based on attention scores computed in low-dimensional space for sparse attention. However, this method still suffers from the increasing KV cache size.”
— SALS: Sparse Attention in Latent Space for KV cache Compression
Beaten on benchmarks
Head-to-head results where a newer method reports beating Loki. Values are copied from the source paper's tables — verify against the cited paper.
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Loki · Avg (RULER) [Llama-3-8B-1M]
86.88 vs 9.33
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Loki · Avg (LongBench) [Llama-3-8B-1M]
39.94 vs 15.78
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Loki · Avg (RULER) [GLM-4-9B-1M]
85.62 vs 28.57
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Loki · Avg (LongBench) [GLM-4-9B-1M]
47.89 vs 32.35
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Loki · Avg (RULER) [Llama-3.1-8B]
83.57 vs 35.52
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Sys beats Loki · Avg (LongBench) [Llama-3.1-8B]
48.13 vs 27.02
- SALS: Sparse Attention in Latent Space for KV cache Compression
SALS beats Loki · Avg [LongBench tasks, token sparse methods]
32.26 vs 31.95
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- May 21, 2026
- May 8, 2026
- Mar 24, 2026
- Mar 17, 2026
- Mar 15, 2026
- Feb 5, 2026
- Jan 29, 2026
- GPU-ccelerated INT8 quantization for KV cache compressionGPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language ModelsJan 8, 2026
- STA-AttentionUnlocking the Address Book: Dissecting the Sparse Semantic Structure of LLM Key-Value Caches via Sparse AutoencodersDec 11, 2025
- SWANSWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache CompressionNov 24, 2025
- Oct 28, 2025
- Sep 25, 2025