Ziv Scully

h-index12

5papers

516citations

Novelty49%

AI Score47

Ranked #30,766 of 194,257 authors (top 16%)#7,285 in LG (top 18%)

5 Papers

2.0PFMay 2

Priority Scheduling in the M/G/1 with Preemption Overhead

Shefali Ramakrishna, Edwin Peng, Ziv Scully

Virtually all practical settings where preemptive scheduling is employed are susceptible to preemption overhead, and accounting for these overheads is necessary to make informed scheduling design decisions. However, preemption overhead is almost never accounted for in queueing-theoretic analyses of preemptive scheduling policies. This is true even for simple preemptive policies in simple queueing models: even the stability region, let alone the response time distribution, is difficult to analyze under overhead. In this work, we give the first response time distribution analysis of an M/G/1 under a preemptive scheduling policy with preemption overhead. Specifically, we consider class-based preemptive priority, where a stochastic overhead is incurred when pausing or resuming a job. We derive a recursive formula for the Laplace transform of response time for jobs of any given class, from which all response time moments can be extracted. Beyond the specific policy and model we analyze, our broader aim is to provide a first step towards a general framework for analyzing queues with preemption overhead. To that end, we perform much of our analysis in a way that applies to a wide variety of overhead models by introducing a new theoretical tool called the job joint transform.

19.0LGJun 16Code

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

Yueying Li, Yuanfan Chen, Jiayang Chen et al.

LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean-centric metrics such as TTFT and TBT. We show that these prediction-driven policies can be fragile under distribution shifts, bursty arrivals, and GPU memory pressure, while offering limited control over the tail latency (P90-P99) that dominates user experience, even with perfect decode-length knowledge. We introduce a distribution-aware, prediction-free scheduling framework that replaces explicit length prediction with soft priority boosting driven by lightweight statistical signals. Our design co-optimizes scheduling and cache-aware preemption to account for memory-coupled decode dynamics across workload mixes. Evaluated on production and open-source traces, our method reduces P99 TTLT by up to 35-50% relative to SRPT with perfect length knowledge and reduces TTFT by 34-47% across workloads, including reasoning-heavy and chat-heavy tasks. These results demonstrate a robust alternative for optimizing tail latency in online LLM serving.

11.3OCJun 12, 2025

The Gittins Index: A Design Principle for Decision-Making Under Uncertainty

Ziv Scully, Alexander Terenin

The Gittins index is a tool that optimally solves a variety of decision-making problems involving uncertainty, including multi-armed bandit problems, minimizing mean latency in queues, and search problems like the Pandora's box model. However, despite the above examples and later extensions thereof, the space of problems that the Gittins index can solve perfectly optimally is limited, and its definition is rather subtle compared to those of other multi-armed bandit algorithms. As a result, the Gittins index is often regarded as being primarily a concept of theoretical importance, rather than a practical tool for solving decision-making problems. The aim of this tutorial is to demonstrate that the Gittins index can be fruitfully applied to practical problems. We start by giving an example-driven introduction to the Gittins index, then walk through several examples of problems it solves - some optimally, some suboptimally but still with excellent performance. Two practical highlights in the latter category are applying the Gittins index to Bayesian optimization, and applying the Gittins index to minimizing tail latency in queues.

9.4LGJul 16, 2025

Cost-aware Stopping for Bayesian Optimization

Qian Xie, Linda Cai, Alexander Terenin et al.

In automated machine learning, scientific discovery, and other applications of Bayesian optimization, deciding when to stop evaluating expensive black-box functions is an important practical consideration. While several adaptive stopping rules have been proposed, in the cost-aware setting they lack guarantees ensuring they stop before incurring excessive function evaluation costs. We propose a cost-aware stopping rule for Bayesian optimization that adapts to varying evaluation costs and is free of heuristic tuning. Our rule is grounded in a theoretical connection to state-of-the-art cost-aware acquisition functions, namely the Pandora's Box Gittins Index (PBGI) and log expected improvement per cost. We prove a theoretical guarantee bounding the expected cumulative evaluation cost incurred by our stopping rule when paired with these two acquisition functions. In experiments on synthetic and empirical tasks, including hyperparameter optimization and neural architecture size search, we show that combining our stopping rule with the PBGI acquisition function usually matches or outperforms other acquisition-function--stopping-rule pairs in terms of cost-adjusted simple regret, a metric capturing trade-offs between solution quality and cumulative evaluation cost.

17.0LGJun 28, 2024Code

Cost-aware Bayesian Optimization via the Pandora's Box Gittins Index

Qian Xie, Raul Astudillo, Peter I. Frazier et al.

Bayesian optimization is a technique for efficiently optimizing unknown functions in a black-box manner. To handle practical settings where gathering data requires use of finite resources, it is desirable to explicitly incorporate function evaluation costs into Bayesian optimization policies. To understand how to do so, we develop a previously-unexplored connection between cost-aware Bayesian optimization and the Pandora's Box problem, a decision problem from economics. The Pandora's Box problem admits a Bayesian-optimal solution based on an expression called the Gittins index, which can be reinterpreted as an acquisition function. We study the use of this acquisition function for cost-aware Bayesian optimization, and demonstrate empirically that it performs well, particularly in medium-high dimensions. We further show that this performance carries over to classical Bayesian optimization without explicit evaluation costs. Our work constitutes a first step towards integrating techniques from Gittins index theory into Bayesian optimization.