inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference
This addresses the challenge for cloud providers and organizations in optimizing GPU fleet capacity planning for LLM inference, which is incremental as it builds on existing queueing and simulation methods.
The paper tackles the problem of sizing GPU fleets for LLM inference by developing inference-fleet-sim, a tool that combines analytical queueing theory with discrete-event simulation to find minimum-cost configurations meeting P99 TTFT SLOs, demonstrating its effectiveness on seven scenarios with public and synthetic traces.
Sizing a GPU fleet for LLM inference is harder than it looks. The obvious questions -- how many GPUs, which type, where to split a two-pool fleet -- have no closed-form answers. They depend on the full token-length distribution, the routing policy, and queueing dynamics that turn ugly under heavy-tailed workloads. Existing tools optimize per-engine configuration for a fixed GPU count; none of them address the upstream question of how many GPUs to buy and how to arrange them. inference-fleet-sim fills that gap. It combines analytical M/G/c queueing with discrete-event simulation (DES) to find the minimum-cost fleet configuration that empirically meets a P99 TTFT SLO. It includes a physics-informed GPU performance model covering A10G, A100, and H100 across monolithic, two-pool-routed, and disaggregated topologies, all without requiring access to real hardware. We run the tool on seven fleet-planning scenarios drawn from two public workload traces (LMSYS, Azure) and one synthetic agent-heavy trace. Each one surfaces a result that simple analysis gets wrong -- the right split threshold, the cheapest GPU type, whether an apparently idle fleet is actually broken -- and shows why joint simulation of queueing, routing, and hardware is necessary to find it.