Regulating Branch Parallelism in LLM Serving

Swapnil Gandhi, Siva Hari, William J. Dally, Christos Kozyrakis

arXiv:2605.0691429.5

Predicted impact top 6% in DC · last 90 daysOriginality Incremental advance

AI Analysis

For LLM serving systems, TAPER addresses the branch externality problem to improve throughput and SLO attainment under intra-request parallelism.

TAPER, a per-step admission controller for regulating branch parallelism in LLM serving, improves goodput by 1.77× over IRP-Off and 1.48× over IRP-Eager while maintaining over 95% SLO attainment by dynamically admitting branches only when the predicted branch externality fits within the batch's slack budget.

Recent methods expose intra-request parallelism in LLM outputs, allowing independent branches to decode concurrently. Existing serving systems execute these branches eagerly or under fixed caps. We show that both are brittle: eager admission inflates the shared decode step, degrading co-batched requests in serial stages, while conservative fixed caps forgo the throughput that motivated exposing branches in the first place. We call the excess step latency caused by admitted branches the branch externality and show that the safe width depends on batch composition, context lengths, and accumulated slack, all of which change continuously over a workload trace. We introduce TAPER, a per-step admission controller that treats extra branches as opportunistic work, admitted only when the predicted branch externality fits within the batch's current slack budget. Per-step regulation is practical because branch-level scheduling decouples compute from memory: branches share the request's prefix KV, so expanding or contracting width requires no memory reclamation. On Qwen3-32B, TAPER improves goodput by $1.77\times$ over IRP-Off and by $1.48\times$ over IRP-Eager, while maintaining over $95\%$ SLO attainment.

View on arXiv PDF

Similar