Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

arXiv:2604.269687.5

Predicted impact top 42% in AR · last 90 daysOriginality Highly original

AI Analysis

This work tackles memory management bottlenecks for large-scale GPU inference serving, a critical problem for cost and throughput in production AI systems.

KV cache memory management is the primary bottleneck in large-scale GPU inference. The proposed system addresses three inefficiencies: unified KV cache sizing across attention architectures (reducing memory over-provisioning by up to 57x), a six-tier memory hierarchy extending capacity from 40 GB to over 38 TB per node, and a Bayesian reuse predictor achieving 70-84% cache hit rates, with projected 1.4-2.1x TTFT reduction, 1.7-2.9x throughput improvement, and 47% cost reduction.

Key-value (KV) cache memory management is the primary bottleneck limiting throughput and cost-efficiency in large-scale GPU inference serving. Current systems suffer from three compounding inefficiencies: (1) the absence of unified KV cache sizing across all attention architectures--particularly multi-head latent attention (MLA), which is unsupported in general-purpose frameworks, resulting in up to 57x memory over-provisioning; (2) confinement of KV cache to a single memory tier (GPU HBM) despite the availability of a rich hierarchy spanning CPU DRAM, CXL-attached memory, NVMe via GPUDirect Storage, RDMA fabric, and parallel filesystems; and (3) reactive eviction policies that discard reusable state, forcing redundant recomputation. We present a unified system that addresses all three problems. Our architecture-variant-aware sizing engine computes exact memory requirements per attention type, enabling up to 7.4x higher batch sizes. A six-tier memory hierarchy extends effective KV cache capacity from 40 GB to over 38 TB per node while maintaining sub-millisecond time-to-first-token (TTFT) for hot entries. A Bayesian reuse predictor with Beta conjugate priors over 16 (block-type, transition-type) pairs achieves 70-84% cache hit rates, combined with EMA-scored head-granular eviction and RoPE-aware prefetching. Component-level validation on trace replay using ShareGPT, LMSYS-Chat-1M, and agentic workloads demonstrates 70-84% cache hit rates. Analytical projections combining validated component behavior with published hardware specifications indicate 1.4-2.1x projected TTFT reduction, 1.7-2.9x throughput improvement, and 47% cost reduction compared to state-of-the-art baselines.

View on arXiv PDF

Similar