LGOct 13, 2023
Optimal Sample Complexity for Average Reward Markov Decision ProcessesShengbo Wang, Jose Blanchet, Peter Glynn · stanford
We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of $\widetilde O(|S||A|t_{\text{mix}}^2 ε^{-2})$ and a lower bound of $Ω(|S||A|t_{\text{mix}} ε^{-2})$. In these expressions, $|S|$ and $|A|$ denote the cardinalities of the state and action spaces respectively, $t_{\text{mix}}$ serves as a uniform upper limit for the total variation mixing times, and $ε$ signifies the error tolerance. Therefore, a notable gap of $t_{\text{mix}}$ still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of $\widetilde O(|S||A|t_{\text{mix}}ε^{-2})$. This marks the first algorithm and analysis to reach the literature's lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin and Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings.
LGFeb 15, 2023
Optimal Sample Complexity of Reinforcement Learning for Mixing Discounted Markov Decision ProcessesShengbo Wang, Jose Blanchet, Peter Glynn · stanford
We consider the optimal sample complexity theory of tabular reinforcement learning (RL) for maximizing the infinite horizon discounted reward in a Markov decision process (MDP). Optimal worst-case complexity results have been developed for tabular RL problems in this setting, leading to a sample complexity dependence on $γ$ and $ε$ of the form $\tilde Θ((1-γ)^{-3}ε^{-2})$, where $γ$ denotes the discount factor and $ε$ is the solution error tolerance. However, in many applications of interest, the optimal policy (or all policies) induces mixing. We establish that in such settings, the optimal sample complexity dependence is $\tilde Θ(t_{\text{mix}}(1-γ)^{-2}ε^{-2})$, where $t_{\text{mix}}$ is the total variation mixing time. Our analysis is grounded in regeneration-type ideas, which we believe are of independent interest, as they can be used to study RL problems for general state space MDPs.
77.2MLMay 25
Statistical Inference for Stochastic Gradient Descent Beyond Finite VarianceJose Blanchet, Peter Glynn, Wenhao Yang
Stochastic gradient descent (SGD) is a foundational algorithm for large-scale statistical learning and stochastic optimization. However, statistical inference based on SGD iterates remains challenging when stochastic gradients have infinite variance, as the relevant limiting distributions depend on unknown nuisance parameters. In this paper, we develop an efficient, model-agnostic methodology for constructing confidence regions from SGD trajectories that applies in both finite- and infinite-variance regimes. The procedure is based on a joint weak convergence result for the Polyak-Ruppert averaged estimator and an empirical second-moment normalizer constructed from stochastic gradients along the SGD trajectory. This joint limit yields a self-normalized statistic in which the leading tail-dependent scaling terms cancel. We then use a subsampling calibration scheme to estimate the relevant critical values, avoiding explicit estimation of tail indices, slowly varying functions, or stable-law parameters. The resulting confidence regions are straightforward to implement and are asymptotically valid under both the finite- and infinite-second-moment regimes. Simulation studies show reliable coverage in various settings, supporting the proposed method as a practical tool for uncertainty quantification in stochastic optimization.
AIMar 14, 2022
The Design and Implementation of a Broadly Applicable Algorithm for Optimizing Intra-Day Surgical SchedulingJin Xie, Teng Zhang, Jose Blanchet et al.
Surgical scheduling optimization is an active area of research. However, few algorithms to optimize surgical scheduling are implemented and see sustained use. An algorithm is more likely to be implemented, if it allows for surgeon autonomy, i.e., requires only limited scheduling centralization, and functions in the limited technical infrastructure of widely used electronic medical records (EMRs). In order for an algorithm to see sustained use, it must be compatible with changes to hospital capacity, patient volumes, and scheduling practices. To meet these objectives, we developed the BEDS (better elective day of surgery) algorithm, a greedy heuristic for smoothing unit-specific surgical admissions across days. We implemented BEDS in the EMR of a large pediatric academic medical center. The use of BEDS was associated with a reduction in the variability in the number of admissions. BEDS is freely available as a dashboard in Tableau, a commercial software used by numerous hospitals. BEDS is readily implementable with the limited tools available to most hospitals, does not require reductions to surgeon autonomy or centralized scheduling, and is compatible with changes to hospital capacity or patient volumes. We present a general algorithmic framework from which BEDS is derived based on a particular choice of objectives and constraints. We argue that algorithms generated by this framework retain many of the desirable characteristics of BEDS while being compatible with a wide range of objectives and constraints.
41.5DSMar 15
A Single-Sample Polylogarithmic Regret Bound for Nonstationary Online Linear ProgrammingHaoran Xu, Owen Shen, Peter Glynn et al.
We study nonstationary Online Linear Programming (OLP), where $n$ orders arrive sequentially with reward-resource consumption pairs that form a sequence of independent, but not necessarily identically distributed, random vectors. At the beginning of the planning horizon, the decision-maker is provided with a resource endowment that is sufficient to fulfill a significant portion of the requests. The decision-maker seeks to maximize the expected total reward by making immediate and irrevocable acceptance or rejection decisions for each order, subject to this resource endowment. We focus on the challenging single-sample setting, where only one sample from each of the $n$ distributions is available at the start of the planning horizon. We propose a novel re-solving algorithm that integrates a dynamic programming perspective with the dual-based frameworks traditionally employed in stationary environments. In the large-resource regime, where the resource endowment scales linearly with the number of orders, we prove that our algorithm achieves $O((\log n)^2)$ regret across a broad class of nonstationary distribution sequences. Our results demonstrate that polylogarithmic regret is attainable even under significant environmental shifts and minimal data availability, bridging the gap between stationary OLP and more volatile real-world resource allocation problems.
LGAug 22, 2025
Deep Learning for Markov Chains: Lyapunov Functions, Poisson's Equation, and Stationary DistributionsYanlin Qu, Jose Blanchet, Peter Glynn
Lyapunov functions are fundamental to establishing the stability of Markovian models, yet their construction typically demands substantial creativity and analytical effort. In this paper, we show that deep learning can automate this process by training neural networks to satisfy integral equations derived from first-transition analysis. Beyond stability analysis, our approach can be adapted to solve Poisson's equation and estimate stationary distributions. While neural networks are inherently function approximators on compact domains, it turns out that our approach remains effective when applied to Markov chains on non-compact state spaces. We demonstrate the effectiveness of this methodology through several examples from queueing theory and beyond.
LGFeb 13, 2022
Surgical Scheduling via Optimization and Machine Learning with Long-Tailed DataYuan Shi, Saied Mahdian, Jose Blanchet et al.
Using data from cardiovascular surgery patients with long and highly variable post-surgical lengths of stay (LOS), we develop a modeling framework to reduce recovery unit congestion. We estimate the LOS and its probability distribution using machine learning models, schedule procedures on a rolling basis using a variety of optimization models, and estimate performance with simulation. The machine learning models achieved only modest LOS prediction accuracy, despite access to a very rich set of patient characteristics. Compared to the current paper-based system used in the hospital, most optimization models failed to reduce congestion without increasing wait times for surgery. A conservative stochastic optimization with sufficient sampling to capture the long tail of the LOS distribution outperformed the current manual process and other stochastic and robust optimization approaches. These results highlight the perils of using oversimplified distributional models of LOS for scheduling procedures and the importance of using optimization methods well-suited to dealing with long-tailed behavior.
LGAug 24, 2019
Optimal $δ$-Correct Best-Arm Selection for Heavy-Tailed DistributionsShubhada Agrawal, Sandeep Juneja, Peter Glynn
Given a finite set of unknown distributions or arms that can be sampled, we consider the problem of identifying the one with the maximum mean using a $δ$-correct algorithm (an adaptive, sequential algorithm that restricts the probability of error to a specified $δ$) that has minimum sample complexity. Lower bounds for $δ$-correct algorithms are well known. $δ$-correct algorithms that match the lower bound asymptotically as $δ$ reduces to zero have been previously developed when arm distributions are restricted to a single parameter exponential family. In this paper, we first observe a negative result that some restrictions are essential, as otherwise, under a $δ$-correct algorithm, distributions with unbounded support would require an infinite number of samples in expectation. We then propose a $δ$-correct algorithm that matches the lower bound as $δ$ reduces to zero under the mild restriction that a known bound on the expectation of $(1+ε)^{th}$ moment of the underlying random variables exists, for $ε> 0$. We also propose batch processing and identify near-optimal batch sizes to speed up the proposed algorithm substantially. The best-arm problem has many learning applications, including recommendation systems and product selection. It is also a well-studied classic problem in the simulation community.
MLJun 7, 2019
Optimal Transport Relaxations with Application to Wasserstein GANsSaied Mahdian, Jose Blanchet, Peter Glynn
We propose a family of relaxations of the optimal transport problem which regularize the problem by introducing an additional minimization step over a small region around one of the underlying transporting measures. The type of regularization that we obtain is related to smoothing techniques studied in the optimization literature. When using our approach to estimate optimal transport costs based on empirical measures, we obtain statistical learning bounds which are useful to guide the amount of regularization, while maintaining good generalization properties. To illustrate the computational advantages of our regularization approach, we apply our method to training Wasserstein GANs. We obtain running time improvements, relative to current benchmarks, with no deterioration in testing performance (via FID). The running time improvement occurs because our new optimality-based threshold criterion reduces the number of expensive iterates of the generating networks, while increasing the number of actor-critic iterations.
LGJan 30, 2019
Probability Functional Descent: A Unifying Perspective on GANs, Variational Inference, and Reinforcement LearningCasey Chu, Jose Blanchet, Peter Glynn
This paper provides a unifying view of a wide range of problems of interest in machine learning by framing them as the minimization of functionals defined on the space of probability measures. In particular, we show that generative adversarial networks, variational inference, and actor-critic methods in reinforcement learning can all be seen through the lens of our framework. We then discuss a generic optimization algorithm for our formulation, called probability functional descent (PFD), and show how this algorithm recovers existing methods developed independently in the settings mentioned earlier.
ROMay 5, 2018
An Accelerated Approach to Safely and Efficiently Test Pre-Production Autonomous Vehicles on Public StreetsMansur Arief, Peter Glynn, Ding Zhao
Various automobile and mobility companies, for instance Ford, Uber and Waymo, are currently testing their pre-produced autonomous vehicle (AV) fleets on the public roads. However, due to rareness of the safety-critical cases and, effectively, unlimited number of possible traffic scenarios, these on-road testing efforts have been acknowledged as tedious, costly, and risky. In this study, we propose Accelerated De- ployment framework to safely and efficiently estimate the AVs performance on public streets. We showed that by appropriately addressing the gradual accuracy improvement and adaptively selecting meaningful and safe environment under which the AV is deployed, the proposed framework yield to highly accurate estimation with much faster evaluation time, and more importantly, lower deployment risk. Our findings provide an answer to the currently heated and active discussions on how to properly test AV performance on public roads so as to achieve safe, efficient, and statistically-reliable testing framework for AV technologies.
PRApr 4, 2018
Probabilistic Contraction Analysis of Iterated Random OperatorsAbhishek Gupta, Rahul Jain, Peter Glynn
In many branches of engineering, Banach contraction mapping theorem is employed to establish the convergence of certain deterministic algorithms. Randomized versions of these algorithms have been developed that have proved useful in data-driven problems. In a class of randomized algorithms, in each iteration, the contraction map is approximated with an operator that uses independent and identically distributed samples of certain random variables. This leads to iterated random operators acting on an initial point in a complete metric space, and it generates a Markov chain. In this paper, we develop a new stochastic dominance based proof technique, called probabilistic contraction analysis, for establishing the convergence in probability of Markov chains generated by such iterated random operators in certain limiting regime. The methods developed in this paper provides a general framework for understanding convergence of a wide variety of Monte Carlo methods in which contractive property is present. We apply the convergence result to conclude the convergence of fitted value iteration and fitted relative value iteration in continuous state and continuous action Markov decision problems as representative applications of the general framework developed here.
OCJun 18, 2017
On the convergence of mirror descent beyond stochastic convex programmingZhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos et al.
In this paper, we examine the convergence of mirror descent in a class of stochastic optimization problems that are not necessarily convex (or even quasi-convex), and which we call variationally coherent. Since the standard technique of "ergodic averaging" offers no tangible benefits beyond convex programming, we focus directly on the algorithm's last generated sample (its "last iterate"), and we show that it converges with probabiility $1$ if the underlying problem is coherent. We further consider a localized version of variational coherence which ensures local convergence of stochastic mirror descent (SMD) with high probability. These results contribute to the landscape of non-convex stochastic optimization by showing that (quasi-)convexity is not essential for convergence to a global minimum: rather, variational coherence, a much weaker requirement, suffices. Finally, building on the above, we reveal an interesting insight regarding the convergence speed of SMD: in problems with sharp minima (such as generic linear programs or concave minimization problems), SMD reaches a minimum point in a finite number of steps (a.s.), even in the presence of persistent gradient noise. This result is to be contrasted with existing black-box convergence rate estimates that are only asymptotic.
MLOct 11, 2016
Statistics of Robust Optimization: A Generalized Empirical Likelihood ApproachJohn Duchi, Peter Glynn, Hongseok Namkoong
We study statistical inference and distributionally robust solution methods for stochastic optimization problems, focusing on confidence intervals for optimal values and solutions that achieve exact coverage asymptotically. We develop a generalized empirical likelihood framework---based on distributional uncertainty sets constructed from nonparametric $f$-divergence balls---for Hadamard differentiable functionals, and in particular, stochastic optimization problems. As consequences of this theory, we provide a principled method for choosing the size of distributional uncertainty regions to provide one- and two-sided confidence intervals that achieve exact coverage. We also give an asymptotic expansion for our distributionally robust formulation, showing how robustification regularizes problems by their variance. Finally, we show that optimizers of the distributionally robust formulations we study enjoy (essentially) the same consistency properties as those in classical sample average approximations. Our general approach applies to quickly mixing stationary sequences, including geometrically ergodic Harris recurrent Markov chains.
PRJul 16, 2015
Selecting the best system and multi-armed banditsPeter Glynn, Sandeep Juneja
Consider the problem of finding a population or a probability distribution amongst many with the largest mean when these means are unknown but population samples can be simulated or otherwise generated. Typically, by selecting largest sample mean population, it can be shown that false selection probability decays at an exponential rate. Lately, researchers have sought algorithms that guarantee that this probability is restricted to a small $δ$ in order $\log(1/δ)$ computational time by estimating the associated large deviations rate function via simulation. We show that such guarantees are misleading when populations have unbounded support even when these may be light-tailed. Specifically, we show that any policy that identifies the correct population with probability at least $1-δ$ for each problem instance requires infinite number of samples in expectation in making such a determination in any problem instance. This suggests that some restrictions are essential on populations to devise $O(\log(1/δ))$ algorithms with $1 - δ$ correctness guarantees. We note that under restriction on population moments, such methods are easily designed, and that sequential methods from stochastic multi-armed bandit literature can be adapted to devise such algorithms.