Peter W. Glynn

LG
12papers
178citations
Novelty50%
AI Score29

12 Papers

OCOct 17, 2022
Risk-Sensitive Markov Decision Processes with Long-Run CVaR Criterion

Li Xia, Peter W. Glynn

CVaR (Conditional Value at Risk) is a risk metric widely used in finance. However, dynamically optimizing CVaR is difficult since it is not a standard Markov decision process (MDP) and the principle of dynamic programming fails. In this paper, we study the infinite-horizon discrete-time MDP with a long-run CVaR criterion, from the view of sensitivity-based optimization. By introducing a pseudo CVaR metric, we derive a CVaR difference formula which quantifies the difference of long-run CVaR under any two policies. The optimality of deterministic policies is derived. We obtain a so-called Bellman local optimality equation for CVaR, which is a necessary and sufficient condition for local optimal policies and only necessary for global optimal policies. A CVaR derivative formula is also derived for providing more sensitivity information. Then we develop a policy iteration type algorithm to efficiently optimize CVaR, which is shown to converge to local optima in the mixed policy space. We further discuss some extensions including the mean-CVaR optimization and the maximization of CVaR. Finally, we conduct numerical experiments relating to portfolio management to demonstrate the main results. Our work may shed light on dynamically optimizing CVaR from a sensitivity viewpoint.

LGOct 11, 2022
The Typical Behavior of Bandit Algorithms

Lin Fan, Peter W. Glynn

We establish strong laws of large numbers and central limit theorems for the regret of two of the most popular bandit algorithms: Thompson sampling and UCB. Here, our characterizations of the regret distribution complement the characterizations of the tail of the regret distribution recently developed by Fan and Glynn (2021) (arXiv:2109.13595). The tail characterizations there are associated with atypical bandit behavior on trajectories where the optimal arm mean is under-estimated, leading to mis-identification of the optimal arm and large regret. In contrast, our SLLN's and CLT's here describe the typical behavior and fluctuation of regret on trajectories where the optimal arm mean is properly estimated. We find that Thompson sampling and UCB satisfy the same SLLN and CLT, with the asymptotics of both the SLLN and the (mean) centering sequence in the CLT matching the asymptotics of expected regret. Both the mean and variance in the CLT grow at $\log(T)$ rates with the time horizon $T$. Asymptotically as $T \to \infty$, the variability in the number of plays of each sub-optimal arm depends only on the rewards received for that arm, which indicates that each sub-optimal arm contributes independently to the overall CLT variance.

OCApr 25, 2016
A Generalized Fundamental Matrix for Computing Fundamental Quantities of Markov Systems

Li Xia, Peter W. Glynn

As is well known, the fundamental matrix $(I - P + e π)^{-1}$ plays an important role in the performance analysis of Markov systems, where $P$ is the transition probability matrix, $e$ is the column vector of ones, and $π$ is the row vector of the steady state distribution. It is used to compute the performance potential (relative value function) of Markov decision processes under the average criterion, such as $g=(I - P + e π)^{-1} f$ where $g$ is the column vector of performance potentials and $f$ is the column vector of reward functions. However, we need to pre-compute $π$ before we can compute $(I - P + e π)^{-1}$. In this paper, we derive a generalization version of the fundamental matrix as $(I - P + e r)^{-1}$, where $r$ can be any given row vector satisfying $r e \neq 0$. With this generalized fundamental matrix, we can compute $g=(I - P + e r)^{-1} f$. The steady state distribution is computed as $π= r(I - P + e r)^{-1}$. The Q-factors at every state-action pair can also be computed in a similar way. These formulas may give some insights on further understanding how to efficiently compute or estimate the values of $g$, $π$, and Q-factors in Markov systems, which are fundamental quantities for the performance optimization of Markov systems.

MLDec 13, 2022
Minimax Optimal Estimation of Stability Under Distribution Shift

Hongseok Namkoong, Yuanzhe Ma, Peter W. Glynn

The performance of decision policies and prediction models often deteriorates when applied to environments different from the ones seen during training. To ensure reliable operation, we analyze the stability of a system under distribution shift, which is defined as the smallest change in the underlying environment that causes the system's performance to deteriorate beyond a permissible threshold. In contrast to standard tail risk measures and distributionally robust losses that require the specification of a plausible magnitude of distribution shift, the stability measure is defined in terms of a more intuitive quantity: the level of acceptable performance degradation. We develop a minimax optimal estimator of stability and analyze its convergence rate, which exhibits a fundamental phase shift behavior. Our characterization of the minimax convergence rate shows that evaluating stability against large performance degradation incurs a statistical cost. Empirically, we demonstrate the practical utility of our stability framework by using it to compare system designs on problems where robustness to distribution shift is critical.

LGAug 1, 2024
Online Linear Programming with Batching

Haoran Xu, Peter W. Glynn, Yinyu Ye

We study Online Linear Programming (OLP) with batching. The planning horizon is cut into $K$ batches, and the decisions on customers arriving within a batch can be delayed to the end of their associated batch. Compared with OLP without batching, the ability to delay decisions brings better operational performance, as measured by regret. Two research questions of interest are: (1) What is a lower bound of the regret as a function of $K$? (2) What algorithms can achieve the regret lower bound? These questions have been analyzed in the literature when the distribution of the reward and the resource consumption of the customers have finite support. By contrast, this paper analyzes these questions when the conditional distribution of the reward given the resource consumption is continuous, and we show the answers are different under this setting. When there is only a single type of resource and the decision maker knows the total number of customers, we propose an algorithm with a $O(\log K)$ regret upper bound and provide a $Ω(\log K)$ regret lower bound. We also propose algorithms with $O(\log K)$ regret upper bound for the setting in which there are multiple types of resource and the setting in which customers arrive following a Poisson process. All these regret upper and lower bounds are independent of the length of the planning horizon, and all the proposed algorithms delay decisions on customers arriving in only the first and the last batch. We also take customer impatience into consideration and establish a way of selecting an appropriate batch size.

LGSep 28, 2021
The Fragility of Optimized Bandit Algorithms

Lin Fan, Peter W. Glynn

Much of the literature on optimal design of bandit algorithms is based on minimization of expected regret. It is well known that designs that are optimal over certain exponential families can achieve expected regret that grows logarithmically in the number of arm plays, at a rate governed by the Lai-Robbins lower bound. In this paper, we show that when one uses such optimized designs, the regret distribution of the associated algorithms necessarily has a very heavy tail, specifically, that of a truncated Cauchy distribution. Furthermore, for $p>1$, the $p$'th moment of the regret distribution grows much faster than poly-logarithmically, in particular as a power of the total number of arm plays. We show that optimized UCB bandit designs are also fragile in an additional sense, namely when the problem is even slightly mis-specified, the regret can grow much faster than the conventional theory suggests. Our arguments are based on standard change-of-measure ideas, and indicate that the most likely way that regret becomes larger than expected is when the optimal arm returns below-average rewards in the first few arm plays, thereby causing the algorithm to believe that the arm is sub-optimal. To alleviate the fragility issues exposed, we show that UCB algorithms can be modified so as to ensure a desired degree of robustness to mis-specification. In doing so, we also show a sharp trade-off between the amount of UCB exploration and the heaviness of the resulting regret distribution tail.

OCJul 6, 2021
Distributed stochastic optimization with large delays

Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos et al.

One of the most widely used methods for solving large-scale stochastic optimization problems is distributed asynchronous stochastic gradient descent (DASGD), a family of algorithms that result from parallelizing stochastic gradient descent on distributed computing architectures (possibly) asychronously. However, a key obstacle in the efficient implementation of DASGD is the issue of delays: when a computing node contributes a gradient update, the global model parameter may have already been updated by other nodes several times over, thereby rendering this gradient information stale. These delays can quickly add up if the computational throughput of a node is saturated, so the convergence of DASGD may be compromised in the presence of large delays. Our first contribution is that, by carefully tuning the algorithm's step-size, convergence to the critical set is still achieved in mean square, even if the delays grow unbounded at a polynomial rate. We also establish finer results in a broad class of structured optimization problems (called variationally coherent), where we show that DASGD converges to a global optimum with probability $1$ under the same delay assumptions. Together, these results contribute to the broad landscape of large-scale non-convex stochastic optimization by offering state-of-the-art theoretical guarantees and providing insights for algorithm design.

LGMay 19, 2021
Diffusion Approximations for Thompson Sampling in the Small Gap Regime

Lin Fan, Peter W. Glynn

We study the process-level dynamics of Thompson sampling in the ``small gap'' regime. The small gap regime is one in which the gaps between the arm means are of order $\sqrtγ$ or smaller and the time horizon is of order $1/γ$, where $γ$ is small. As $γ\downarrow 0$, we show that the process-level dynamics of Thompson sampling converge weakly to the solutions to certain stochastic differential equations and stochastic ordinary differential equations. Our weak convergence theory is developed from first principles using the Continuous Mapping Theorem, can handle stationary, weakly dependent reward processes, and can also be adapted to analyze a variety of sampling-based bandit algorithms. Indeed, we show that the process-level dynamics of many sampling-based bandit algorithms -- including Thompson sampling designed for any single-parameter exponential family of rewards, as well as non-parametric bandit algorithms based on bootstrap re-sampling -- satisfy an invariance principle. Namely, their weak limits coincide with that of Gaussian parametric Thompson sampling with Gaussian priors. Moreover, in the small gap regime, the regret performance of these algorithms is generally insensitive to model mis-specification, changing continuously with increasing degrees of mis-specification.

OCMar 12, 2021
On Incorporating Forecasts into Linear State Space Model Markov Decision Processes

Jacques A. de Chalendar, Peter W. Glynn

Weather forecast information will very likely find increasing application in the control of future energy systems. In this paper, we introduce an augmented state space model formulation with linear dynamics, within which one can incorporate forecast information that is dynamically revealed alongside the evolution of the underlying state variable. We use the martingale model for forecast evolution (MMFE) to enforce the necessary consistency properties that must govern the joint evolution of forecasts with the underlying state. The formulation also generates jointly Markovian dynamics that give rise to Markov decision processes (MDPs) that remain computationally tractable. This paper is the first to enforce MMFE consistency requirements within an MDP formulation that preserves tractability.

LGApr 14, 2020
Sequential Batch Learning in Finite-Action Linear Contextual Bandits

Yanjun Han, Zhengqing Zhou, Zhengyuan Zhou et al.

We study the sequential batch learning problem in linear contextual bandits with finite action sets, where the decision maker is constrained to split incoming individuals into (at most) a fixed number of batches and can only observe outcomes for the individuals within a batch at the batch's end. Compared to both standard online contextual bandits learning or offline policy learning in contexutal bandits, this sequential batch learning problem provides a finer-grained formulation of many personalized sequential decision making problems in practical applications, including medical treatment in clinical trials, product recommendation in e-commerce and adaptive experiment design in crowdsourcing. We study two settings of the problem: one where the contexts are arbitrarily generated and the other where the contexts are \textit{iid} drawn from some distribution. In each setting, we establish a regret lower bound and provide an algorithm, whose regret upper bound nearly matches the lower bound. As an important insight revealed therefrom, in the former setting, we show that the number of batches required to achieve the fully online performance is polynomial in the time horizon, while for the latter setting, a pure-exploitation algorithm with a judicious batch partition scheme achieves the fully online performance even when the number of batches is less than logarithmic in the time horizon. Together, our results provide a near-complete characterization of sequential decision making in linear contextual bandits when batch constraints are present.

OCSep 12, 2018
A $c/μ$-Rule for Service Resource Allocation in Group-Server Queues

Li Xia, Zhe George Zhang, Quan-Lin Li et al.

In this paper, we study a dynamic on/off server scheduling problem in a queueing system with multi-class servers, where servers are heterogeneous and can be classified into $K$ groups. Servers in the same group are homogeneous. A scheduling policy determines the number of working servers (servers that are turned on) in each group at every state $n$ (number of customers in the system). Our goal is to find the optimal scheduling policy to minimize the long-run average cost, which consists of an increasing convex holding cost and a linear operating cost. We use the sensitivity-based optimization theory to characterize the optimal policy. A necessary and sufficient condition of the optimal policy is derived. We also prove that the optimal policy has monotone structures and a quasi bang-bang control is optimal. We find that the optimal policy is indexed by the value of $c - μG(n)$, where $c$ is the operating cost rate, $μ$ is the service rate for a server, and $G(n)$ is a computable quantity called perturbation realization factor. Specifically, the group with smaller negative $c - μG(n)$ is more preferred to be turned on, while the group with positive $c - μG(n)$ should be turned off. However, the preference ranking of each group is affected by $G(n)$ and the preference order may change with the state $n$, the arrival rate, and the cost function. Under a reasonable condition of scale economies, we further prove that the optimal policy obeys a so-called $c$/$μ$-rule. That is, the servers with smaller $c$/$μ$ should be turned on with higher priority and the preference order of groups remains unchanged. This rule can be viewed as a sister version of the famous $cμ$-rule for polling queues. With the monotone property of $G(n)$, we further prove that the optimal policy has a multi-threshold structure when the $c$/$μ$-rule is applied.

PRDec 26, 2013
Shape-constrained Estimation of Value Functions

Mohammad Mousavi, Peter W. Glynn

We present a fully nonparametric method to estimate the value function, via simulation, in the context of expected infinite-horizon discounted rewards for Markov chains. Estimating such value functions plays an important role in approximate dynamic programming and applied probability in general. We incorporate "soft information" into the estimation algorithm, such as knowledge of convexity, monotonicity, or Lipchitz constants. In the presence of such information, a nonparametric estimator for the value function can be computed that is provably consistent as the simulated time horizon tends to infinity. As an application, we implement our method on price tolling agreement contracts in energy markets.