OSApr 9

Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate

Fangyue Liu, Hua Liu, Xinyuan Lyu, Shuo Ai, Hao Liang, Lingpeng Chen, Ziqian Hu, Chong Zha, Xin Jin, Hanmei Luo, Peng Chen

arXiv:2604.0787453.9

Predicted impact top 25% in OS · last 90 daysOriginality Incremental advance

AI Analysis

This addresses resource inefficiency in production LLM services, offering a practical solution for cloud providers or large-scale AI deployments, though it is incremental as it builds on existing colocation concepts.

The paper tackles the problem of low GPU utilization due to bursty LLM inference traffic by proposing Valve, a production-friendly online-offline colocation system that improves cluster utilization by 34.6%, saving 2,170 GPUs, with minimal online interference (<5% TTFT and <2% TPOT increase).

LLM inference powers latency-critical production services nowadays. The bursty nature of inference traffic results in over-provisioning, which in turn leads to resource underutilization. While online-offline colocation promises to utilize idle capacity, broad production deployment must overcome two major challenges: (i) large online interference due to slow or frequent preemptions, and (ii) extensive frameworks and drivers modifications, to colocate different models and support preemptions. We present Valve, a production-friendly colocation system that jointly bounds preemption latency and preemption rate. Specifically, Valve enables sub-millisecond compute preemption at most once per online request, and rate-limited sub-layer memory reclamation. These guaranties are provided by a GPU runtime that combines channel-controlled compute isolation, page-fault-free memory reclamation, and dynamic memory reservation. Critically, Valve is practical to deploy, requiring one line of driver modification and 20 lines of framework patch. Deployed on 8,054 GPUs in production, Valve improves cluster utilization by 34.6%, which translates to a 2,170 GPU save. This efficiency gains is achieved with minimal online interference, incurring <5% TTFT increase and <2% TPOT increase across workloads.

View on arXiv PDF

Similar