DC AI ARAug 2, 2025

PiKV: KV Cache Management System for Mixture of Experts

Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu, Xuhong Wang

arXiv:2508.06526v13 citationsh-index: 5Has Code

Originality Incremental advance

AI Analysis

This work addresses a critical performance issue for developers and researchers using MoE-based models in multi-GPU and multi-node inference settings, representing an incremental improvement by optimizing existing cache management methods specifically for MoE architectures.

The paper tackles the memory and communication bottleneck of dense KV cache storage in Mixture of Experts (MoE) architectures during large language model inference by introducing PiKV, a parallel and distributed KV cache management system that reduces overhead through expert-sharded storage, routing, and compression, achieving significant reductions in memory usage and latency as demonstrated in experiments.

As large language models continue to scale up in both size and context length, the memory and communication cost of key-value (KV) cache storage has become a major bottleneck in multi-GPU and multi-node inference. While MoE-based architectures sparsify computation across experts, the corresponding KV caches remain dense and globally synchronized, resulting in significant overhead. We introduce \textbf{PiKV}, a parallel and distributed KV cache serving framework tailored for MoE architecture. PiKV leverages \textit{expert-sharded KV storage} to partition caches across GPUs, \textit{PiKV routing} to reduce token-to-KV access, and a \textit{PiKV Scheduling} to adaptively retain query-relevant entries. To further reduce memory usage, PiKV integrates \textit{PiKV Compression} modules the caching pipeline for acceleration. PiKV is recently publicly available as an open-source software library: \href{https://github.com/NoakLiu/PiKV}{https://github.com/NoakLiu/PiKV}. Experiments details is recorded at: \href{https://github.com/NoakLiu/PiKV/blob/main/downstream_tasks/README.md}{https://github.com/NoakLiu/PiKV/Experimental\_Results}. We also have PiKV integrated with Nvidia kvpress for acceleration, details see \href{https://github.com/NoakLiu/PiKVpress}{https://github.com/NoakLiu/PiKVpress}. PiKV is still a living project, aiming to become a comprehesive KV Cache management system for MoE Architectures.

View on arXiv PDF Code

Similar