ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs
This work provides a significant improvement in the efficiency and scalability of long-context LLM inference for developers and researchers working with large language models.
This paper introduces ParisKV, a KV-cache retrieval framework designed for long-context LLM inference that addresses issues of distribution drift and high latency. ParisKV achieves state-of-the-art long-context decoding efficiency, matching or exceeding full attention speed at batch size 1, delivering up to 2.8x higher throughput, and reducing decode latency by 17x and 44x compared to MagicPIG and PQCache, respectively, at million-token scale.
KV-cache retrieval is essential for long-context LLM inference, yet existing methods struggle with distribution drift and high latency at scale. We introduce ParisKV, a drift-robust, GPU-native KV-cache retrieval framework based on collision-based candidate selection, followed by a quantized inner-product reranking estimator. For million-token contexts, ParisKV supports CPU-offloaded KV caches via Unified Virtual Addressing (UVA), enabling on-demand top-$k$ fetching with minimal overhead. ParisKV matches or outperforms full attention quality on long-input and long-generation benchmarks. It achieves state-of-the-art long-context decoding efficiency: it matches or exceeds full attention speed even at batch size 1 for long contexts, delivers up to 2.8$\times$ higher throughput within full attention's runnable range, and scales to million-token contexts where full attention runs out of memory. At million-token scale, ParisKV reduces decode latency by 17$\times$ and 44$\times$ compared to MagicPIG and PQCache, respectively, two state-of-the-art KV-cache Top-$k$ retrieval baselines, code is available at https://github.com/amy-77/ParisKV/tree/main.