SYMay 30, 2016
State Estimation for the Individual and the Population in Mean Field Control with Application to Demand DispatchYue Chen, Ana Bušić, Sean Meyn
This paper concerns state estimation problems in a mean field control setting. In a finite population model, the goal is to estimate the joint distribution of the population state and the state of a typical individual. The observation equations are a noisy measurement of the population. The general results are applied to demand dispatch for regulation of the power grid, based on randomized local control algorithms. In prior work by the authors it has been shown that local control can be carefully designed so that the aggregate of loads behaves as a controllable resource with accuracy matching or exceeding traditional sources of frequency regulation. The operational cost is nearly zero in many cases. The information exchange between grid and load is minimal, but it is assumed in the overall control architecture that the aggregate power consumption of loads is available to the grid operator. It is shown that the Kalman filter can be constructed to reduce these communication requirements,
OCMar 18, 2016
Distributed Randomized Control for Demand DispatchAna Bušić, Sean Meyn
The paper concerns design of control systems for Demand Dispatch to obtain ancillary services to the power grid by harnessing inherent flexibility in many loads. The role of "local intelligence" at the load has been advocated in prior work, randomized local controllers that manifest this intelligence are convenient for loads with a finite number of states. The present work introduces two new design techniques for these randomized controllers: (i) The Individual Perspective Design (IPD) is based on the solution to a one-dimensional family of Markov Decision Processes, whose objective function is formulated from the point of view of a single load. The family of dynamic programming equation appears complex, but it is shown that it is obtained through the solution of a single ordinary differential equation. (ii) The System Perspective Design (SPD) is motivated by a single objective of the grid operator: Passivity of any linearization of the aggregate input-output model. A solution is obtained that can again be computed through the solution of a single ordinary differential equation. Numerical results complement these theoretical results.
OCSep 24, 2014
Individual risk in mean-field control models for decentralized control, with application to automated demand responseYue Chen, Ana Bušić, Sean Meyn
Flexibility of energy consumption can be harnessed for the purposes of ancillary services in a large power grid. In prior work by the authors a randomized control architecture is introduced for individual loads for this purpose. In examples it is shown that the control architecture can be designed so that control of the loads is easy at the grid level: Tracking of a balancing authority reference signal is possible, while ensuring that the quality of service (QoS) for each load is acceptable on average. The analysis was based on a mean field limit (as the number of loads approaches infinity), combined with an LTI-system approximation of the aggregate nonlinear model. This paper examines in depth the issue of individual risk in these systems. The main contributions of the paper are of two kinds: Risk is modeled and quantified: (i) The average performance is not an adequate measure of success. It is found empirically that a histogram of QoS is approximately Gaussian, and consequently each load will eventually receive poor service. (ii) The variance can be estimated from a refinement of the LTI model that includes a white-noise disturbance; variance is a function of the randomized policy, as well as the power spectral density of the reference signal. Additional local control can eliminate risk: (iii) The histogram of QoS is truncated through this local control, so that strict bounds on service quality are guaranteed. (iv) This has insignificant impact on the grid-level performance, beyond a modest reduction in capacity of ancillary service.
SYOct 24, 2019
Demand Dispatch with Heterogeneous Intelligent LoadsJoel Mathias, Ana Bušić, Sean Meyn
A distributed control architecture is presented that is intended to make a collection of heterogeneous loads appear to the grid operator as a nearly perfect battery. Local control is based on randomized decision rules advocated in prior research, and extended in this paper to any load with a discrete number of power states. Additional linear filtering at the load ensures that the input-output dynamics of the aggregate has a nearly flat input-output response: the behavior of an ideal, multi-GW battery system.
OCOct 22, 2016
Ordinary Differential Equation Methods For Markov Decision Processes and Application to Kullback-Leibler Control CostAna Bušić, Sean Meyn
A new approach to computation of optimal policies for MDP (Markov decision process) models is introduced. The main idea is to solve not one, but an entire family of MDPs, parameterized by a weighting factor $ζ$ that appears in the one-step reward function. For an MDP with $d$ states, the family of value functions $\{ h^*_ζ: ζ\in\Re\}$ is the solution to an ODE, $$ \frac{d}{dζ} h^*_ζ= {\cal V}(h^*_ζ) $$ where the vector field ${\cal V}\colon\Re^d\to\Re^d$ has a simple form, based on a matrix inverse. This general methodology is applied to a family of average-cost optimal control models in which the one-step reward function is defined by Kullback-Leibler divergence. The motivation for this reward function in prior work is computation: The solution to the MDP can be expressed in terms of the Perron-Frobenius eigenvector for an associated positive matrix. The drawback with this approach is that no hard constraints on the control are permitted. It is shown here that it is possible to extend this framework to model randomness from nature that cannot be modified by the controller. Perron-Frobenius theory is no longer applicable -- the resulting dynamic programming equations appear as complex as a completely unstructured MDP model. Despite this apparent complexity, it is shown that this class of MDPs admits a solution via this new ODE technique. This approach is new and practical even for the simpler problem in which randomness from nature is absent.
LGJul 5, 2023
Stability of Q-Learning Through Design and OptimismSean Meyn
Q-learning has become an important part of the reinforcement learning toolkit since its introduction in the dissertation of Chris Watkins in the 1980s. The purpose of this paper is in part a tutorial on stochastic approximation and Q-learning, providing details regarding the INFORMS APS inaugural Applied Probability Trust Plenary Lecture, presented in Nancy France, June 2023. The paper also presents new approaches to ensure stability and potentially accelerated convergence for these algorithms, and stochastic approximation in other settings. Two contributions are entirely new: 1. Stability of Q-learning with linear function approximation has been an open topic for research for over three decades. It is shown that with appropriate optimistic training in the form of a modified Gibbs policy, there exists a solution to the projected Bellman equation, and the algorithm is stable (in terms of bounded parameter estimates). Convergence remains one of many open topics for research. 2. The new Zap Zero algorithm is designed to approximate the Newton-Raphson flow without matrix inversion. It is stable and convergent under mild assumptions on the mean flow vector field for the algorithm, and compatible statistical assumption on an underlying Markov chain. The algorithm is a general approach to stochastic approximation which in particular applies to Q-learning with "oblivious" training even with non-linear function approximation.
OCSep 10, 2023
Convex Q Learning in a Stochastic Environment: Extended VersionFan Lu, Sean Meyn
The paper introduces the first formulation of convex Q-learning for Markov decision processes with function approximation. The algorithms and theory rest on a relaxation of a dual of Manne's celebrated linear programming characterization of optimal control. The main contributions firstly concern properties of the relaxation, described as a deterministic convex program: we identify conditions for a bounded solution, and a significant relationship between the solution to the new convex program, and the solution to standard Q-learning. The second set of contributions concern algorithm design and analysis: (i) A direct model-free method for approximating the convex program for Q-learning shares properties with its ideal. In particular, a bounded solution is ensured subject to a simple property of the basis functions; (ii) The proposed algorithms are convergent and new techniques are introduced to obtain the rate of convergence in a mean-square sense; (iii) The approach can be generalized to a range of performance criteria, and it is found that variance can be reduced by considering ``relative'' dynamic programming equations; (iv) The theory is illustrated with an application to a classical inventory control problem.
SYSep 24, 2014
Passive Dynamics in Mean Field ControlAna Bušić, Sean Meyn
Mean-field models are a popular tool in a variety of fields. They provide an understanding of the impact of interactions among a large number of particles or people or other "self-interested agents", and are an increasingly popular tool in distributed control. This paper considers a particular randomized distributed control architecture introduced in our own recent work. In numerical results it was found that the associated mean-field model had attractive properties for purposes of control. In particular, when viewed as an input-output system, its linearization was found to be minimum phase. In this paper we take a closer look at the control model. The results are summarized as follows: (i) The Markov Decision Process framework of Todorov is extended to continuous time models, in which the "control cost" is based on relative entropy. This is the basis of the construction of a family of controlled Markovian generators. (ii) A decentralized control architecture is proposed in which each agent evolves as a controlled Markov process. A central authority broadcasts a common control signal to each agent. The central authority chooses this signal based on an aggregate scalar output of the Markovian agents. (iii) Provided the control-free system is a reversible Markov process, the following identity holds for the linearization, \[ \text{Real} (G(jω)) = \text{PSD}_Y(ω)\ge 0, \quad ω\in\Re, \] where the right hand side denotes the power spectral density for the output of any one of the individual (control-free) Markov processes.
OCOct 14, 2022
Model-Free Characterizations of the Hamilton-Jacobi-Bellman Equation and Convex Q-Learning in Continuous TimeFan Lu, Joel Mathias, Sean Meyn et al.
Convex Q-learning is a recent approach to reinforcement learning, motivated by the possibility of a firmer theory for convergence, and the possibility of making use of greater a priori knowledge regarding policy or value function structure. This paper explores algorithm design in the continuous time domain, with finite-horizon optimal control objective. The main contributions are (i) Algorithm design is based on a new Q-ODE, which defines the model-free characterization of the Hamilton-Jacobi-Bellman equation. (ii) The Q-ODE motivates a new formulation of Convex Q-learning that avoids the approximations appearing in prior work. The Bellman error used in the algorithm is defined by filtered measurements, which is beneficial in the presence of measurement noise. (iii) A characterization of boundedness of the constraint region is obtained through a non-trivial extension of recent results from the discrete time setting. (iv) The theory is illustrated in application to resource allocation for distributed energy resources, for which the theory is ideally suited.
OCOct 17, 2022
Sufficient Exploration for Convex Q-learningFan Lu, Prashant Mehta, Sean Meyn et al.
In recent years there has been a collective research effort to find new formulations of reinforcement learning that are simultaneously more efficient and more amenable to analysis. This paper concerns one approach that builds on the linear programming (LP) formulation of optimal control of Manne. A primal version is called logistic Q-learning, and a dual variant is convex Q-learning. This paper focuses on the latter, while building bridges with the former. The main contributions follow: (i) The dual of convex Q-learning is not precisely Manne's LP or a version of logistic Q-learning, but has similar structure that reveals the need for regularization to avoid over-fitting. (ii) A sufficient condition is obtained for a bounded solution to the Q-learning LP. (iii) Simulation studies reveal numerical challenges when addressing sampled-data systems based on a continuous time model. The challenge is addressed using state-dependent sampling. The theory is illustrated with applications to examples from OpenAI gym. It is shown that convex Q-learning is successful in cases where standard Q-learning diverges, such as the LQR problem.
30.7LGMar 29
Stability and Sensitivity Analysis of Relative Temporal-Difference Learning: Extended VersionMasoud S. Sakha, Rushikesh Kamalapurkar, Sean Meyn
Relative temporal-difference (TD) learning was introduced to mitigate the slow convergence of TD methods when the discount factor approaches one by subtracting a baseline from the temporal-difference update. While this idea has been studied in the tabular setting, stability guarantees with function approximation remain poorly understood. This paper analyzes relative TD learning with linear function approximation. We establish stability conditions for the algorithm and show that the choice of baseline distribution plays a central role. In particular, when the baseline is chosen as the empirical distribution of the state-action process, the algorithm is stable for any non-negative baseline weight and any discount factor. We also provide a sensitivity analysis of the resulting parameter estimates, characterizing both asymptotic bias and covariance. The asymptotic covariance and asymptotic bias are shown to remain uniformly bounded as the discount factor approaches one.
LGFeb 5
Optimistic Training and Convergence of Q-Learning -- Extended VersionPrashant Mehta, Sean Meyn
In recent work it is shown that Q-learning with linear function approximation is stable, in the sense of bounded parameter estimates, under the $(\varepsilon,κ)$-tamed Gibbs policy; $κ$ is inverse temperature, and $\varepsilon>0$ is introduced for additional exploration. Under these assumptions it also follows that there is a solution to the projected Bellman equation (PBE). Left open is uniqueness of the solution, and criteria for convergence outside of the standard tabular or linear MDP settings. The present work extends these results to other variants of Q-learning, and clarifies prior work: a one dimensional example shows that under an oblivious policy for training there may be no solution to the PBE, or multiple solutions, and in each case the algorithm is not stable under oblivious training. The main contribution is that far more structure is required for convergence. An example is presented for which the basis is ideal, in the sense that the true Q-function is in the span of the basis. However, there are two solutions to the PBE under the greedy policy, and hence also for the $(\varepsilon,κ)$-tamed Gibbs policy for all sufficiently small $\varepsilon>0$ and $κ\ge 1$.
STOct 27, 2021
The ODE Method for Asymptotic Statistics in Stochastic Approximation and Reinforcement LearningVivek Borkar, Shuhang Chen, Adithya Devraj et al.
The paper concerns the $d$-dimensional stochastic approximation recursion, $$ θ_{n+1}= θ_n + α_{n + 1} f(θ_n, Φ_{n+1}) $$ where $ \{ Φ_n \}$ is a stochastic process on a general state space, satisfying a conditional Markov property that allows for parameter-dependent noise. The main results are established under additional conditions on the mean flow and a version of the Donsker-Varadhan Lyapunov drift condition known as (DV3): (i) An appropriate Lyapunov function is constructed that implies convergence of the estimates in $L_4$. (ii) A functional central limit theorem (CLT) is established, as well as the usual one-dimensional CLT for the normalized error. Moment bounds combined with the CLT imply convergence of the normalized covariance $\textsf{E}[ z_n z_n^T ]$ to the asymptotic covariance in the CLT, where $z_n =: (θ_n-θ^*)/\sqrt{α_n}$. (iii) The CLT holds for the normalized version $z^{\text{PR}}_n =: \sqrt{n} [θ^{\text{PR}}_n -θ^*]$, of the averaged parameters $θ^{\text{PR}}_n =:n^{-1} \sum_{k=1}^nθ_k$, subject to standard assumptions on the step-size. Moreover, the covariance in the CLT coincides with the minimal covariance of Polyak and Ruppert. (iv) An example is given where $f$ and $\bar{f}$ are linear in $θ$, and $Φ$ is a geometrically ergodic Markov chain but does not satisfy (DV3). While the algorithm is convergent, the second moment of $θ_n$ is unbounded and in fact diverges. This arXiv version represents a major extension of the results in prior versions.The main results now allow for parameter-dependent noise, as is often the case in applications to reinforcement learning.
OCSep 30, 2020
Accelerating Optimization and Reinforcement Learning with Quasi-Stochastic ApproximationShuhang Chen, Adithya Devraj, Andrey Bernstein et al.
The ODE method has been a workhorse for algorithm design and analysis since the introduction of the stochastic approximation. It is now understood that convergence theory amounts to establishing robustness of Euler approximations for ODEs, while theory of rates of convergence requires finer analysis. This paper sets out to extend this theory to quasi-stochastic approximation, based on algorithms in which the "noise" is based on deterministic signals. The main results are obtained under minimal assumptions: the usual Lipschitz conditions for ODE vector fields, and it is assumed that there is a well defined linearization near the optimal parameter $θ^*$, with Hurwitz linearization matrix $A^*$. The main contributions are summarized as follows: (i) If the algorithm gain is $a_t=g/(1+t)^ρ$ with $g>0$ and $ρ\in(0,1)$, then the rate of convergence of the algorithm is $1/t^ρ$. There is also a well defined "finite-$t$" approximation: \[ a_t^{-1}\{Θ_t-θ^*\}=\bar{Y}+Ξ^{\mathrm{I}}_t+o(1) \] where $\bar{Y}\in\mathbb{R}^d$ is a vector identified in the paper, and $\{Ξ^{\mathrm{I}}_t\}$ is bounded with zero temporal mean. (ii) With gain $a_t = g/(1+t)$ the results are not as sharp: the rate of convergence $1/t$ holds only if $I + g A^*$ is Hurwitz. (iii) Based on the Ruppert-Polyak averaging of stochastic approximation, one would expect that a convergence rate of $1/t$ can be obtained by averaging: \[ Θ^{\text{RP}}_T=\frac{1}{T}\int_{0}^T Θ_t\,dt \] where the estimates $\{Θ_t\}$ are obtained using the gain in (i). The preceding sharp bounds imply that averaging results in $1/t$ convergence rate if and only if $\bar{Y}=\sf 0$. This condition holds if the noise is additive, but appears to fail in general. (iv) The theory is illustrated with applications to gradient-free optimization and policy gradient algorithms for reinforcement learning.
PRFeb 7, 2020
Explicit Mean-Square Error Bounds for Monte-Carlo and Linear Stochastic ApproximationShuhang Chen, Adithya M. Devraj, Ana Bušić et al.
This paper concerns error bounds for recursive equations subject to Markovian disturbances. Motivating examples abound within the fields of Markov chain Monte Carlo (MCMC) and Reinforcement Learning (RL), and many of these algorithms can be interpreted as special cases of stochastic approximation (SA). It is argued that it is not possible in general to obtain a Hoeffding bound on the error sequence, even when the underlying Markov chain is reversible and geometrically ergodic, such as the M/M/1 queue. This is motivation for the focus on mean square error bounds for parameter estimates. It is shown that mean square error achieves the optimal rate of $O(1/n)$, subject to conditions on the step-size sequence. Moreover, the exact constants in the rate are obtained, which is of great value in algorithm design.
OCSep 17, 2018
Optimal Matrix Momentum Stochastic Approximation and Applications to Q-learningAdithya M. Devraj, Ana Bušić, Sean Meyn
Acceleration is an increasingly common theme in the stochastic optimization literature. The two most common examples are Nesterov's method, and Polyak's momentum technique. In this paper two new algorithms are introduced for root finding problems: 1) PolSA is a root finding algorithm with specially designed matrix momentum, and 2) NeSA can be regarded as a variant of Nesterov's algorithm, or a simplification of PolSA. The PolSA algorithm is new even in the context of optimization (when cast as a root finding problem). The research surveyed in this paper is motivated by applications to reinforcement learning. It is well known that most variants of TD- and Q-learning may be cast as SA (stochastic approximation) algorithms, and the tools from general SA theory can be used to investigate convergence and bounds on convergence rate. In particular, the asymptotic variance is a common metric of performance for SA algorithms, and is also one among many metrics used in assessing the performance of stochastic optimization algorithms. There are two well known SA techniques that are known to have optimal asymptotic variance: the Ruppert-Polyak averaging technique, and stochastic Newton-Raphson (SNR). The former algorithm can have extremely bad transient performance, and the latter can be computationally expensive. It is demonstrated here that parameter estimates from the new PolSA algorithm couple with those of the ideal (but more complex) SNR algorithm. The new algorithm is thus a third approach to obtain optimal asymptotic covariance. These strong results require assumptions on the model. A linearized model is considered, and the noise is assumed to be a martingale difference sequence. Numerical results are obtained in a non-linear setting that is the motivation for this work: In PolSA implementations of Q-learning it is observed that coupling occurs with SNR in this non-ideal setting.
SYAug 31, 2016
Estimation and Control of Quality of Service in Demand DispatchYue Chen, Ana Bušić, Sean Meyn
It is now well known that flexibility of energy consumption can be harnessed for the purposes of grid-level ancillary services. In particular, through distributed control of a collection of loads, a balancing authority regulation signal can be tracked accurately, while ensuring that the quality of service (QoS) for each load is acceptable {\it on average}. In this paper it is argued that a histogram of QoS is approximately Gaussian, and consequently each load will eventually receive poor service. Statistical techniques are developed to estimate the mean and variance of QoS as a function of the power spectral density of the regulation signal. It is also shown that additional local control can eliminate risk: The histogram of QoS is {\it truncated} through this local control, so that strict bounds on service quality are guaranteed. While there is a tradeoff between the grid-level tracking performance (capacity and accuracy) and the bounds imposed on QoS, it is found that the loss of capacity is minor in typical cases.
SYSep 4, 2015
Smart Fridge / Dumb Grid? Demand Dispatch for the Power Grid of 2020Joel Mathias, Rim Kaddah, Ana Bušić et al.
In discussions at the 2015 HICSS meeting, it was argued that loads can provide most of the ancillary services required today and in the future. Through load-level and grid-level control design, high-quality ancillary service for the grid is obtained without impacting quality of service delivered to the consumer. This approach to grid regulation is called demand dispatch: loads are providing service continuously and automatically, without consumer interference. In this paper we ask, what intelligence is required at the grid-level? In particular, does the grid-operator require more than one-way communication to the loads? Our main conclusion: risk is not great in lower frequency ranges, e.g., PJM's RegA or BPA's balancing reserves. In particular, ancillary services from refrigerators and pool-pumps can be obtained successfully with only one-way communication. This requires intelligence at the loads, and much less intelligence at the grid level.
LGJul 6, 2013
Approximate dynamic programming using fluid and diffusion approximations with applications to power managementWei Chen, Dayu Huang, Ankur A. Kulkarni et al.
Neuro-dynamic programming is a class of powerful techniques for approximating the solution to dynamic programming equations. In their most computationally attractive formulations, these techniques provide the approximate solution only within a prescribed finite-dimensional function class. Thus, the question that always arises is how should the function class be chosen? The goal of this paper is to propose an approach using the solutions to associated fluid and diffusion approximations. In order to illustrate this approach, the paper focuses on an application to dynamic speed scaling for power management in computer processors.