OSApr 9

Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate

arXiv:2604.0787453.9
Predicted impact top 25% in OS · last 90 daysOriginality Incremental advance
AI Analysis

This addresses resource inefficiency in production LLM services, offering a practical solution for cloud providers or large-scale AI deployments, though it is incremental as it builds on existing colocation concepts.

The paper tackles the problem of low GPU utilization due to bursty LLM inference traffic by proposing Valve, a production-friendly online-offline colocation system that improves cluster utilization by 34.6%, saving 2,170 GPUs, with minimal online interference (<5% TTFT and <2% TPOT increase).

LLM inference powers latency-critical production services nowadays. The bursty nature of inference traffic results in over-provisioning, which in turn leads to resource underutilization. While online-offline colocation promises to utilize idle capacity, broad production deployment must overcome two major challenges: (i) large online interference due to slow or frequent preemptions, and (ii) extensive frameworks and drivers modifications, to colocate different models and support preemptions. We present Valve, a production-friendly colocation system that jointly bounds preemption latency and preemption rate. Specifically, Valve enables sub-millisecond compute preemption at most once per online request, and rate-limited sub-layer memory reclamation. These guaranties are provided by a GPU runtime that combines channel-controlled compute isolation, page-fault-free memory reclamation, and dynamic memory reservation. Critically, Valve is practical to deploy, requiring one line of driver modification and 20 lines of framework patch. Deployed on 8,054 GPUs in production, Valve improves cluster utilization by 34.6%, which translates to a 2,170 GPU save. This efficiency gains is achieved with minimal online interference, incurring <5% TTFT increase and <2% TPOT increase across workloads.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes