NAFeb 26, 2013
Feedback Particle FilterTao Yang, Prashant G. Mehta, Sean P. Meyn
A new formulation of the particle filter for nonlinear filtering is presented, based on concepts from optimal control, and from the mean-field game theory. The optimal control is chosen so that the posterior distribution of a particle matches as closely as possible the posterior distribution of the true state given the observations. This is achieved by introducing a cost function, defined by the Kullback-Leibler (K-L) divergence between the actual posterior, and the posterior of any particle. The optimal control input is characterized by a certain Euler-Lagrange (E-L) equation, and is shown to admit an innovation error-based feedback structure. For diffusions with continuous observations, the value of the optimal control solution is ideal. The two posteriors match exactly, provided they are initialized with identical priors. The feedback particle filter is defined by a family of stochastic systems, each evolving under this optimal control law. A numerical algorithm is introduced and implemented in two general examples, and a neuroscience application involving coupled oscillators. Some preliminary numerical comparisons between the feed- back particle filter and the bootstrap particle filter are described.
NAMar 5, 2013
Multivariable Feedback Particle FilterTao Yang, Richard S. Laugesen, Prashant G. Mehta et al.
In recent work it is shown that importance sampling can be avoided in the particle filter through an innovation structure inspired by traditional nonlinear filtering combined with Mean-Field Game formalisms. The resulting feedback particle filter (FPF) offers significant variance improvements; in particular, the algorithm can be applied to systems that are not stable. The filter comes with an up-front computational cost to obtain the filter gain. This paper describes new representations and algorithms to compute the gain in the general multivariable setting. The main contributions are, (i) Theory surrounding the FPF is improved: Consistency is established in the multivariate setting, as well as well-posedness of the associated PDE to obtain the filter gain. (ii) The gain can be expressed as the gradient of a function, which is precisely the solution to Poisson's equation for a related MCMC diffusion (the Smoluchowski equation). This provides a bridge to MCMC as well as to approximate optimal filtering approaches such as TD-learning, which can in turn be used to approximate the gain. (iii) Motivated by a weak formulation of Poisson's equation, a Galerkin finite-element algorithm is proposed for approximation of the gain. Its performance is illustrated in numerical experiments.
OCMay 17, 2012
Random-Time, State-Dependent Stochastic Drift for Markov Chains and Application to Stochastic Stabilization Over Erasure ChannelsSerdar Yüksel, Sean P. Meyn
It is known that state-dependent, multi-step Lyapunov bounds lead to greatly simplified verification theorems for stability for large classes of Markov chain models. This is one component of the "fluid model" approach to stability of stochastic networks. In this paper we extend the general theory to randomized multi-step Lyapunov theory to obtain criteria for stability and steady-state performance bounds, such as finite moments. These results are applied to a remote stabilization problem, in which a controller receives measurements from an erasure channel with limited capacity. Based on the general results in the paper it is shown that stability of the closed loop system is assured provided that the channel capacity is greater than the logarithm of the unstable eigenvalue, plus an additional correction term. The existence of a finite second moment in steady-state is established under additional conditions.
OCSep 30, 2019
Diffusion map-based algorithm for Gain function approximation in the Feedback Particle FilterAmirhossein Taghvaei, Prashant G. Mehta, Sean P. Meyn
Feedback particle filter (FPF) is a numerical algorithm to approximate the solution of the nonlinear filtering problem in continuous-time settings. In any numerical implementation of the FPF algorithm, the main challenge is to numerically approximate the so-called gain function. A numerical algorithm for gain function approximation is the subject of this paper. The exact gain function is the solution of a Poisson equation involving a probability-weighted Laplacian $Δ_ρ$. The numerical problem is to approximate this solution using {\em only} finitely many particles sampled from the probability distribution $ρ$. A diffusion map-based algorithm was proposed by the authors in a prior work to solve this problem. The algorithm is named as such because it involves, as an intermediate step, a diffusion map approximation of the exact semigroup $e^{Δ_ρ}$. The original contribution of this paper is to carry out a rigorous error analysis of the diffusion map-based algorithm. The error is shown to include two components: bias and variance. The bias results from the diffusion map approximation of the exact semigroup. The variance arises because of finite sample size. Scalings and upper bounds are derived for bias and variance. These bounds are then illustrated with numerical experiments that serve to emphasize the effects of problem dimension and sample size. The proposed algorithm is applied to two filtering examples and comparisons provided with the sequential importance resampling (SIR) particle filter.
NADec 16, 2016
Error Estimates for the Kernel Gain Function Approximation in the Feedback Particle FilterAmirhossein Taghvaei, Prashant G. Mehta, Sean P. Meyn
This paper is concerned with the analysis of the kernel-based algorithm for gain function approximation in the feedback particle filter. The exact gain function is the solution of a Poisson equation involving a probability-weighted Laplacian. The kernel-based method -- introduced in our prior work -- allows one to approximate this solution using {\em only} particles sampled from the probability distribution. This paper describes new representations and algorithms based on the kernel-based method. Theory surrounding the approximation is improved and a novel formula for the gain function approximation is derived. A procedure for carrying out error analysis of the approximation is introduced. Certain asymptotic estimates for bias and variance are derived for the general nonlinear non-Gaussian case. Comparison with the constant gain function approximation is provided. The results are illustrated with the aid of some numerical experiments.
OCAug 8, 2020
Convex Q-Learning, Part 1: Deterministic Optimal ControlPrashant G. Mehta, Sean P. Meyn
It is well known that the extension of Watkins' algorithm to general function approximation settings is challenging: does the projected Bellman equation have a solution? If so, is the solution useful in the sense of generating a good policy? And, if the preceding questions are answered in the affirmative, is the algorithm consistent? These questions are unanswered even in the special case of Q-function approximations that are linear in the parameter. The challenge seems paradoxical, given the long history of convex analytic approaches to dynamic programming. The paper begins with a brief survey of linear programming approaches to optimal control, leading to a particular over parameterization that lends itself to applications in reinforcement learning. The main conclusions are summarized as follows: (i) The new class of convex Q-learning algorithms is introduced based on the convex relaxation of the Bellman equation. Convergence is established under general conditions, including a linear function approximation for the Q-function. (ii) A batch implementation appears similar to the famed DQN algorithm (one engine behind AlphaZero). It is shown that in fact the algorithms are very different: while convex Q-learning solves a convex program that approximates the Bellman equation, theory for DQN is no stronger than for Watkins' algorithm with function approximation: (a) it is shown that both seek solutions to the same fixed point equation, and (b) the ODE approximations for the two algorithms coincide, and little is known about the stability of this ODE. These results are obtained for deterministic nonlinear systems with total cost criterion. Many extensions are proposed, including kernel implementation, and extension to MDP models.
LGFeb 24, 2020
Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast LearningAdithya M. Devraj, Sean P. Meyn
Sample complexity bounds are a common performance metric in the Reinforcement Learning literature. In the discounted cost, infinite horizon setting, all of the known bounds have a factor that is a polynomial in $1/(1-γ)$, where $γ< 1$ is the discount factor. For a large discount factor, these bounds seem to imply that a very large number of samples is required to achieve an $\varepsilon$-optimal policy. The objective of the present work is to introduce a new class of algorithms that have sample complexity uniformly bounded for all $γ< 1$. One may argue that this is impossible, due to a recent min-max lower bound. The explanation is that this previous lower bound is for a specific problem, which we modify, without compromising the ultimate objective of obtaining an $\varepsilon$-optimal policy. Specifically, we show that the asymptotic covariance of the Q-learning algorithm with an optimized step-size sequence is a quadratic function of $1/(1-γ)$; an expected, and essentially known result. The new relative Q-learning algorithm proposed here is shown to have asymptotic covariance that is a quadratic in $1/(1- ρ^* γ)$, where $1 - ρ^* > 0$ is an upper bound on the spectral gap of an optimal transition matrix.
LGOct 11, 2019
Zap Q-Learning With Nonlinear Function ApproximationShuhang Chen, Adithya M. Devraj, Fan Lu et al.
Zap Q-learning is a recent class of reinforcement learning algorithms, motivated primarily as a means to accelerate convergence. Stability theory has been absent outside of two restrictive classes: the tabular setting, and optimal stopping. This paper introduces a new framework for analysis of a more general class of recursive algorithms known as stochastic approximation. Based on this general theory, it is shown that Zap Q-learning is consistent under a non-degeneracy assumption, even when the function approximation architecture is nonlinear. Zap Q-learning with neural network function approximation emerges as a special case, and is tested on examples from OpenAI Gym. Based on multiple experiments with a range of neural network sizes, it is found that the new algorithms converge quickly and are robust to choice of function approximation architecture.
SYApr 25, 2019
Zap Q-Learning for Optimal Stopping Time ProblemsShuhang Chen, Adithya M. Devraj, Ana Bušić et al.
The objective in this paper is to obtain fast converging reinforcement learning algorithms to approximate solutions to the problem of discounted cost optimal stopping in an irreducible, uniformly ergodic Markov chain, evolving on a compact subset of $\mathbb{R}^n$. We build on the dynamic programming approach taken by Tsitsikilis and Van Roy, wherein they propose a Q-learning algorithm to estimate the optimal state-action value function, which then defines an optimal stopping rule. We provide insights as to why the convergence rate of this algorithm can be slow, and propose a fast-converging alternative, the "Zap-Q-learning" algorithm, designed to achieve optimal rate of convergence. For the first time, we prove the convergence of the Zap-Q-learning algorithm under the assumption of linear function approximation setting. We use ODE analysis for the proof, and the optimal asymptotic variance property of the algorithm is reflected via fast convergence in a finance example.
LGDec 28, 2018
Differential Temporal Difference LearningAdithya M. Devraj, Ioannis Kontoyiannis, Sean P. Meyn
Value functions derived from Markov decision processes arise as a central component of algorithms as well as performance metrics in many statistics and engineering applications of machine learning techniques. Computation of the solution to the associated Bellman equations is challenging in most practical cases of interest. A popular class of approximation techniques, known as Temporal Difference (TD) learning algorithms, are an important sub-class of general reinforcement learning methods. The algorithms introduced in this paper are intended to resolve two well-known difficulties of TD-learning approaches: Their slow convergence due to very high variance, and the fact that, for the problem of computing the relative value function, consistent algorithms exist only in special cases. First we show that the gradients of these value functions admit a representation that lends itself to algorithm design. Based on this result, a new class of differential TD-learning algorithms is introduced. For Markovian models on Euclidean space with smooth dynamics, the algorithms are shown to be consistent under general conditions. Numerical results show dramatic variance reduction when compared to standard methods.
SYJul 12, 2017
Fastest Convergence for Q-learningAdithya M. Devraj, Sean P. Meyn
The Zap Q-learning algorithm introduced in this paper is an improvement of Watkins' original algorithm and recent competitors in several respects. It is a matrix-gain algorithm designed so that its asymptotic variance is optimal. Moreover, an ODE analysis suggests that the transient behavior is a close match to a deterministic Newton-Raphson implementation. This is made possible by a two time-scale update equation for the matrix gain sequence. The analysis suggests that the approach will lead to stable and efficient computation even for non-ideal parameterized settings. Numerical experiments confirm the quick convergence, even in such non-ideal cases. A secondary goal of this paper is tutorial. The first half of the paper contains a survey on reinforcement learning algorithms, with a focus on minimum variance algorithms.
SYApr 6, 2016
Differential TD Learning for Value Function ApproximationAdithya M. Devraj, Sean P. Meyn
Value functions arise as a component of algorithms as well as performance metrics in statistics and engineering applications. Computation of the associated Bellman equations is numerically challenging in all but a few special cases. A popular approximation technique is known as Temporal Difference (TD) learning. The algorithm introduced in this paper is intended to resolve two well-known problems with this approach: In the discounted-cost setting, the variance of the algorithm diverges as the discount factor approaches unity. Second, for the average cost setting, unbiased algorithms exist only in special cases. It is shown that the gradient of any of these value functions admits a representation that lends itself to algorithm design. Based on this result, the new differential TD method is obtained for Markovian models on Euclidean space with smooth dynamics. Numerical examples show remarkable improvements in performance. In application to speed scaling, variance is reduced by two orders of magnitude.