Thinh T. Doan

OC
h-index23
26papers
519citations
Novelty55%
AI Score49

26 Papers

OCMay 27, 2022
Regularized Gradient Descent Ascent for Two-Player Zero-Sum Markov Games

Sihan Zeng, Thinh T. Doan, Justin Romberg

We study the problem of finding the Nash equilibrium in a two-player zero-sum Markov game. Due to its formulation as a minimax optimization program, a natural approach to solve the problem is to perform gradient descent/ascent with respect to each player in an alternating fashion. However, due to the non-convexity/non-concavity of the underlying objective function, theoretical understandings of this method are limited. In our paper, we consider solving an entropy-regularized variant of the Markov game. The regularization introduces structure into the optimization landscape that make the solutions more identifiable and allow the problem to be solved more efficiently. Our main contribution is to show that under proper choices of the regularization parameter, the gradient descent ascent algorithm converges to the Nash equilibrium of the original unregularized problem. We explicitly characterize the finite-time performance of the last iterate of our algorithm, which vastly improves over the existing convergence bound of the gradient descent ascent algorithm without regularization. Finally, we complement the analysis with numerical simulations that illustrate the accelerated convergence of the algorithm.

LGMar 23, 2023
Connected Superlevel Set in (Deep) Reinforcement Learning and its Application to Minimax Theorems

Sihan Zeng, Thinh T. Doan, Justin Romberg

The aim of this paper is to improve the understanding of the optimization landscape for policy optimization problems in reinforcement learning. Specifically, we show that the superlevel set of the objective function with respect to the policy parameter is always a connected set both in the tabular setting and under policies represented by a class of neural networks. In addition, we show that the optimization objective as a function of the policy parameter and reward satisfies a stronger "equiconnectedness" property. To our best knowledge, these are novel and previously unknown discoveries. We present an application of the connectedness of these superlevel sets to the derivation of minimax theorems for robust reinforcement learning. We show that any minimax optimization program which is convex on one side and is equiconnected on the other side observes the minimax equality (i.e. has a Nash equilibrium). We find that this exact structure is exhibited by an interesting robust reinforcement learning problem under an adversarial reward attack, and the validity of its minimax equality immediately follows. This is the first time such a result is established in the literature.

MAJun 15, 2022
Convergence and Price of Anarchy Guarantees of the Softmax Policy Gradient in Markov Potential Games

Dingyang Chen, Qi Zhang, Thinh T. Doan

We study the performance of policy gradient methods for the subclass of Markov games known as Markov potential games (MPGs), which extends the notion of normal-form potential games to the stateful setting and includes the important special case of the fully cooperative setting where the agents share an identical reward function. Our focus in this paper is to study the convergence of the policy gradient method for solving MPGs under softmax policy parameterization, both tabular and parameterized with general function approximators such as neural networks. We first show the asymptotic convergence of this method to a Nash equilibrium of MPGs for tabular softmax policies. Second, we derive the finite-time performance of the policy gradient in two settings: 1) using the log-barrier regularization, and 2) using the natural policy gradient under the best-response dynamics (NPG-BR). Finally, extending the notion of price of anarchy (POA) and smoothness in normal-form games, we introduce the POA for MPGs and provide a POA bound for NPG-BR. To our knowledge, this is the first POA bound for solving MPGs. To support our theoretical results, we empirically compare the convergence rates and POA of policy gradient variants for both tabular and neural softmax policies.

MAApr 1
Internal State-Based Policy Gradient Methods for Partially Observable Markov Potential Games

Wonseok Yang, Thinh T. Doan

This letter studies multi-agent reinforcement learning in partially observable Markov potential games. Solving this problem is challenging due to partial observability, decentralized information, and the curse of dimensionality. First, to address the first two challenges, we leverage the common information framework, which allows agents to act based on both shared and local information. Second, to ensure tractability, we study an internal state that compresses accumulated information, preventing it from growing unboundedly over time. We then implement an internal state-based natural policy gradient method to find Nash equilibria of the Markov potential game. Our main contribution is to establish a non-asymptotic convergence bound for this method. Our theoretical bound decomposes into two interpretable components: a statistical error term that also arises in standard Markov potential games, and an approximation error capturing the use of finite-state controllers. Finally, simulations across multiple partially observable environments demonstrate that the proposed method using finite-state controllers achieves consistent improvements in performance compared to the setting where only the current observation is used.

MLJul 27, 2024
Bayesian meta learning for trustworthy uncertainty quantification

Zhenyuan Yuan, Thinh T. Doan

We consider the problem of Bayesian regression with trustworthy uncertainty quantification. We define that the uncertainty quantification is trustworthy if the ground truth can be captured by intervals dependent on the predictive distributions with a pre-specified probability. Furthermore, we propose, Trust-Bayes, a novel optimization framework for Bayesian meta learning which is cognizant of trustworthy uncertainty quantification without explicit assumptions on the prior model/distribution of the functions. We characterize the lower bounds of the probabilities of the ground truth being captured by the specified intervals and analyze the sample complexity with respect to the feasible probability for trustworthy uncertainty quantification. Monte Carlo simulation of a case study using Gaussian process regression is conducted for verification and comparison with the Meta-prior algorithm.

SYApr 16
CBF-based Probabilistic Safe Navigation under Unknown Nonlinear Obstacle Dynamics

Jiwon Lee, Hugo Matias, Daniel Silvestre et al.

Safe navigation for an ego vehicle in uncertain environments characterized by dynamic obstacles with unknown nonlinear dynamics is a challenging problem of significant practical interest. Existing approaches in the literature either lack formal safety guarantees, require full model knowledge, or fail to account for the risk associated with the vehicle's exact body geometry and the temporal evolution of uncertainty between sampling instants. In this paper, we propose a data-driven observer for the unknown obstacle dynamics that generates an alpha-confidence set flow, which is exactly transformed into a Control Barrier Function (CBF) to enforce (1-alpha)-probability safety. The proposed framework accommodates nonlinear ego vehicle dynamics of arbitrary relative degree, as demonstrated through case studies involving first- and second-order dynamics of an unmanned surface vehicle.

OCJan 23, 2024
Fast Nonlinear Two-Time-Scale Stochastic Approximation: Achieving $O(1/k)$ Finite-Sample Complexity

Thinh T. Doan

This paper proposes to develop a new variant of the two-time-scale stochastic approximation to find the roots of two coupled nonlinear operators, assuming only noisy samples of these operators can be observed. Our key idea is to leverage the classic Ruppert-Polyak averaging technique to dynamically estimate the operators through their samples. The estimated values of these averaging steps will then be used in the two-time-scale stochastic approximation updates to find the desired solution. Our main theoretical result is to show that under the strongly monotone condition of the underlying nonlinear operators the mean-squared errors of the iterates generated by the proposed method converge to zero at an optimal rate $O(1/k)$, where $k$ is the number of iterations. Our result significantly improves the existing result of two-time-scale stochastic approximation, where the best known finite-time convergence rate is $O(1/k^{2/3})$. We illustrate this result by applying the proposed method to develop new reinforcement learning algorithms with improved performance.

OCMay 15, 2024
Fast Two-Time-Scale Stochastic Gradient Method with Applications in Reinforcement Learning

Sihan Zeng, Thinh T. Doan

Two-time-scale optimization is a framework introduced in Zeng et al. (2024) that abstracts a range of policy evaluation and policy optimization problems in reinforcement learning (RL). Akin to bi-level optimization under a particular type of stochastic oracle, the two-time-scale optimization framework has an upper level objective whose gradient evaluation depends on the solution of a lower level problem, which is to find the root of a strongly monotone operator. In this work, we propose a new method for solving two-time-scale optimization that achieves significantly faster convergence than the prior arts. The key idea of our approach is to leverage an averaging step to improve the estimates of the operators in both lower and upper levels before using them to update the decision variables. These additional averaging steps eliminate the direct coupling between the main variables, enabling the accelerated performance of our algorithm. We characterize the finite-time convergence rates of the proposed algorithm under various conditions of the underlying objective function, including strong convexity, Polyak-Lojasiewicz condition, and general non-convexity. These rates significantly improve over the best-known complexity of the standard two-time-scale stochastic approximation algorithm. When applied to RL, we show how the proposed algorithm specializes to novel online sample-based methods that surpass or match the performance of the existing state of the art. Finally, we support our theoretical results with numerical simulations in RL.

LGFeb 14, 2025
Nonasymptotic CLT and Error Bounds for Two-Time-Scale Stochastic Approximation

Seo Taek Kong, Sihan Zeng, Thinh T. Doan et al.

We consider linear two-time-scale stochastic approximation algorithms driven by martingale noise. Recent applications in machine learning motivate the need to understand finite-time error rates, but conventional stochastic approximation analysis focus on either asymptotic convergence in distribution or finite-time bounds that are far from optimal. Prior work on asymptotic central limit theorems (CLTs) suggest that two-time-scale algorithms may be able to achieve $1/\sqrt{n}$ error in expectation, with a constant given by the expected norm of the limiting Gaussian vector. However, the best known finite-time rates are much slower. We derive the first non-asymptotic central limit theorem with respect to the Wasserstein-1 distance for two-time-scale stochastic approximation with Polyak-Ruppert averaging. As a corollary, we show that expected error achieved by Polyak-Ruppert averaging decays at rate $1/\sqrt{n}$, which significantly improves on the rates of convergence in prior works.

LGMar 3, 2025
Accelerating Multi-Task Temporal Difference Learning under Low-Rank Representation

Yitao Bai, Sihan Zeng, Justin Romberg et al.

We study policy evaluation problems in multi-task reinforcement learning (RL) under a low-rank representation setting. In this setting, we are given $N$ learning tasks where the corresponding value function of these tasks lie in an $r$-dimensional subspace, with $r<N$. One can apply the classic temporal-difference (TD) learning method for solving these problems where this method learns the value function of each task independently. In this paper, we are interested in understanding whether one can exploit the low-rank structure of the multi-task setting to accelerate the performance of TD learning. To answer this question, we propose a new variant of TD learning method, where we integrate the so-called truncated singular value decomposition step into the update of TD learning. This additional step will enable TD learning to exploit the dominant directions due to the low rank structure to update the iterates, therefore, improving its performance. Our empirical results show that the proposed method significantly outperforms the classic TD learning, where the performance gap increases as the rank $r$ decreases. From the theoretical point of view, introducing the truncated singular value decomposition step into TD learning might cause an instability on the updates. We provide a theoretical result showing that the instability does not happen. Specifically, we prove that the proposed method converges at a rate $\mathcal{O}(\frac{\ln(t)}{t})$, where $t$ is the number of iterations. This rate matches that of the standard TD learning.

SYMar 31
Finite-Time Analysis of Projected Two-Time-Scale Stochastic Approximation

Yitao Bai, Thinh T. Doan, Justin Romberg

We study the finite-time convergence of projected linear two-time-scale stochastic approximation with constant step sizes and Polyak--Ruppert averaging. We establish an explicit mean-square error bound, decomposing it into two interpretable components, an approximation error determined by the constrained subspace and a statistical error decaying at a sublinear rate, with constants expressed through restricted stability margins and a coupling invertibility condition. These constants cleanly separate the effect of subspace choice (approximation errors) from the effect of the averaging horizon (statistical errors). We illustrate our theoretical results through a number of numerical experiments on both synthetic and reinforcement learning problems.

OCMay 3, 2024
Natural Policy Gradient and Actor Critic Methods for Constrained Multi-Task Reinforcement Learning

Sihan Zeng, Thinh T. Doan, Justin Romberg

Multi-task reinforcement learning (RL) aims to find a single policy that effectively solves multiple tasks at the same time. This paper presents a constrained formulation for multi-task RL where the goal is to maximize the average performance of the policy across tasks subject to bounds on the performance in each task. We consider solving this problem both in the centralized setting, where information for all tasks is accessible to a single server, and in the decentralized setting, where a network of agents, each given one task and observing local information, cooperate to find the solution of the globally constrained objective using local communication. We first propose a primal-dual algorithm that provably converges to the globally optimal solution of this constrained formulation under exact gradient evaluations. When the gradient is unknown, we further develop a sampled-based actor-critic algorithm that finds the optimal policy using online samples of state, action, and reward. Finally, we study the extension of the algorithm to the linear function approximation setting.

OCDec 17, 2021
Convergence Rates of Two-Time-Scale Gradient Descent-Ascent Dynamics for Solving Nonconvex Min-Max Problems

Thinh T. Doan

There are much recent interests in solving noncovnex min-max optimization problems due to its broad applications in many areas including machine learning, networked resource allocations, and distributed optimization. Perhaps, the most popular first-order method in solving min-max optimization is the so-called simultaneous (or single-loop) gradient descent-ascent algorithm due to its simplicity in implementation. However, theoretical guarantees on the convergence of this algorithm is very sparse since it can diverge even in a simple bilinear problem. In this paper, our focus is to characterize the finite-time performance (or convergence rates) of the continuous-time variant of simultaneous gradient descent-ascent algorithm. In particular, we derive the rates of convergence of this method under a number of different conditions on the underlying objective function, namely, two-sided Polyak-L ojasiewicz (PL), one-sided PL, nonconvex-strongly concave, and strongly convex-nonconcave conditions. Our convergence results improve the ones in prior works under the same conditions of objective functions. The key idea in our analysis is to use the classic singular perturbation theory and coupling Lyapunov functions to address the time-scale difference and interactions between the gradient descent and ascent dynamics. Our results on the behavior of continuous-time algorithm may be used to enhance the convergence properties of its discrete-time counterpart.

OCOct 21, 2021
Finite-Time Complexity of Online Primal-Dual Natural Actor-Critic Algorithm for Constrained Markov Decision Processes

Sihan Zeng, Thinh T. Doan, Justin Romberg

We consider a discounted cost constrained Markov decision process (CMDP) policy optimization problem, in which an agent seeks to maximize a discounted cumulative reward subject to a number of constraints on discounted cumulative utilities. To solve this constrained optimization program, we study an online actor-critic variant of a classic primal-dual method where the gradients of both the primal and dual functions are estimated using samples from a single trajectory generated by the underlying time-varying Markov processes. This online primal-dual natural actor-critic algorithm maintains and iteratively updates three variables: a dual variable (or Lagrangian multiplier), a primal variable (or actor), and a critic variable used to estimate the gradients of both primal and dual variables. These variables are updated simultaneously but on different time scales (using different step sizes) and they are all intertwined with each other. Our main contribution is to derive a finite-time analysis for the convergence of this algorithm to the global optimum of a CMDP problem. Specifically, we show that with a proper choice of step sizes the optimality gap and constraint violation converge to zero in expectation at a rate $\mathcal{O}(1/K^{1/6})$, where K is the number of iterations. To our knowledge, this paper is the first to study the finite-time complexity of an online primal-dual actor-critic method for solving a CMDP problem. We also validate the effectiveness of this algorithm through numerical simulations.

OCSep 29, 2021
A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning

Sihan Zeng, Thinh T. Doan, Justin Romberg

We study a new two-time-scale stochastic gradient method for solving optimization problems, where the gradients are computed with the aid of an auxiliary variable under samples generated by time-varying MDPs controlled by the underlying optimization variable. These time-varying samples make gradient directions in our update biased and dependent, which can potentially lead to the divergence of the iterates. In our two-time-scale approach, one scale is to estimate the true gradient from these samples, which is then used to update the estimate of the optimal solution. While these two iterates are implemented simultaneously, the former is updated "faster" than the latter. Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale stochastic gradient method. In particular, we provide explicit formulas for the convergence rates of this method under different structural assumptions, namely, strong convexity, PL condition, and general non-convexity. We apply our framework to various policy optimization problems. First, we look at the infinite-horizon average-reward MDP with finite state and action spaces and derive a convergence rate of $O(k^{-2/5})$ for the online actor-critic algorithm under function approximation, which recovers the best known rate derived specifically for this problem. Second, we study the linear-quadratic regulator and show that an online actor-critic method converges with rate $O(k^{-2/3})$. Third, we use the actor-critic algorithm to solve the policy optimization problem in an entropy regularized Markov decision process, where we also establish a convergence of $O(k^{-2/3})$. The results we derive for both the second and third problem are novel and previously unknown in the literature. Finally, we briefly present the application of our framework to gradient-based policy evaluation algorithms in reinforcement learning.

DCAug 26, 2021
Byzantine Fault-Tolerance in Federated Local SGD under 2f-Redundancy

Nirupam Gupta, Thinh T. Doan, Nitin Vaidya

We consider the problem of Byzantine fault-tolerance in federated machine learning. In this problem, the system comprises multiple agents each with local data, and a trusted centralized coordinator. In fault-free setting, the agents collaborate with the coordinator to find a minimizer of the aggregate of their local cost functions defined over their local data. We consider a scenario where some agents ($f$ out of $N$) are Byzantine faulty. Such agents need not follow a prescribed algorithm correctly, and may communicate arbitrary incorrect information to the coordinator. In the presence of Byzantine agents, a more reasonable goal for the non-faulty agents is to find a minimizer of the aggregate cost function of only the non-faulty agents. This particular goal is commonly referred as exact fault-tolerance. Recent work has shown that exact fault-tolerance is achievable if only if the non-faulty agents satisfy the property of $2f$-redundancy. Now, under this property, techniques are known to impart exact fault-tolerance to the distributed implementation of the classical stochastic gradient-descent (SGD) algorithm. However, we do not know of any such techniques for the federated local SGD algorithm - a more commonly used method for federated machine learning. To address this issue, we propose a novel technique named comparative elimination (CE). We show that, under $2f$-redundancy, the federated local SGD algorithm with CE can indeed obtain exact fault-tolerance in the deterministic setting when the non-faulty agents can accurately compute gradients of their local cost functions. In the general stochastic case, when agents can only compute unbiased noisy estimates of their local gradients, our algorithm achieves approximate fault-tolerance with approximation error proportional to the variance of stochastic gradients and the fraction of Byzantine agents.

OCApr 4, 2021
Finite-Time Convergence Rates of Nonlinear Two-Time-Scale Stochastic Approximation under Markovian Noise

Thinh T. Doan

We study the so-called two-time-scale stochastic approximation, a simulation-based approach for finding the roots of two coupled nonlinear operators. Our focus is to characterize its finite-time performance in a Markov setting, which often arises in stochastic control and reinforcement learning problems. In particular, we consider the scenario where the data in the method are generated by Markov processes, therefore, they are dependent. Such dependent data result to biased observations of the underlying operators. Under some fairly standard assumptions on the operators and the Markov processes, we provide a formula that characterizes the convergence rate of the mean square errors generated by the method to zero. Our result shows that the method achieves a convergence in expectation at a rate $\mathcal{O}(1/k^{2/3})$, where $k$ is the number of iterations. Our analysis is mainly motivated by the classic singular perturbation theory for studying the asymptotic convergence of two-time-scale systems, that is, we consider a Lyapunov function that carefully characterizes the coupling between the two iterates. In addition, we utilize the geometric mixing time of the underlying Markov process to handle the bias and dependence in the data. Our theoretical result complements for the existing literature, where the rate of nonlinear two-time-scale stochastic approximation under Markovian noise is unknown.

LGJan 26, 2021
Finite Sample Analysis of Two-Time-Scale Natural Actor-Critic Algorithm

Sajad Khodadadian, Thinh T. Doan, Justin Romberg et al.

Actor-critic style two-time-scale algorithms are one of the most popular methods in reinforcement learning, and have seen great empirical success. However, their performance is not completely understood theoretically. In this paper, we characterize the \emph{global} convergence of an online natural actor-critic algorithm in the tabular setting using a single trajectory of samples. Our analysis applies to very general settings, as we only assume ergodicity of the underlying Markov decision process. In order to ensure enough exploration, we employ an $ε$-greedy sampling of the trajectory. For a fixed and small enough exploration parameter $ε$, we show that the two-time-scale natural actor-critic algorithm has a rate of convergence of $\tilde{\mathcal{O}}(1/T^{1/4})$, where $T$ is the number of samples, and this leads to a sample complexity of $\Tilde{\mathcal{O}}(1/δ^{8})$ samples to find a policy that is within an error of $δ$ from the \emph{global optimum}. Moreover, by carefully decreasing the exploration parameter $ε$ as the iterations proceed, we present an improved sample complexity of $\Tilde{\mathcal{O}}(1/δ^{6})$ for convergence to the global optimum.

OCNov 3, 2020
Nonlinear Two-Time-Scale Stochastic Approximation: Convergence and Finite-Time Performance

Thinh T. Doan

Two-time-scale stochastic approximation, a generalized version of the popular stochastic approximation, has found broad applications in many areas including stochastic control, optimization, and machine learning. Despite its popularity, theoretical guarantees of this method, especially its finite-time performance, are mostly achieved for the linear case while the results for the nonlinear counterpart are very sparse. Motivated by the classic control theory for singularly perturbed systems, we study in this paper the asymptotic convergence and finite-time analysis of the nonlinear two-time-scale stochastic approximation. Under some fairly standard assumptions, we provide a formula that characterizes the rate of convergence of the main iterates to the desired solutions. In particular, we show that the method achieves a convergence in expectation at a rate $\mathcal{O}(1/k^{2/3})$, where $k$ is the number of iterations. The key idea in our analysis is to properly choose the two step sizes to characterize the coupling between the fast and slow-time-scale iterates.

LGOct 28, 2020
Finite-Time Convergence Rates of Decentralized Stochastic Approximation with Applications in Multi-Agent and Multi-Task Learning

Sihan Zeng, Thinh T. Doan, Justin Romberg

We study a decentralized variant of stochastic approximation, a data-driven approach for finding the root of an operator under noisy measurements. A network of agents, each with its own operator and data observations, cooperatively find the fixed point of the aggregate operator over a decentralized communication graph. Our main contribution is to provide a finite-time analysis of this decentralized stochastic approximation method when the data observed at each agent are sampled from a Markov process; this lack of independence makes the iterates biased and (potentially) unbounded. Under fairly standard assumptions, we show that the convergence rate of the proposed method is essentially the same as if the samples were independent, differing only by a log factor that accounts for the mixing time of the Markov processes. The key idea in our analysis is to introduce a novel Razumikhin-Lyapunov function, motivated by the one used in analyzing the stability of delayed ordinary differential equations. We also discuss applications of the proposed method on a number of interesting learning problems in multi-agent systems.

LGJun 24, 2020
Local Stochastic Approximation: A Unified View of Federated Learning and Distributed Multi-Task Reinforcement Learning Algorithms

Thinh T. Doan

Motivated by broad applications in reinforcement learning and federated learning, we study local stochastic approximation over a network of agents, where their goal is to find the root of an operator composed of the local operators at the agents. Our focus is to characterize the finite-time performance of this method when the data at each agent are generated from Markov processes, and hence they are dependent. In particular, we provide the convergence rates of local stochastic approximation for both constant and time-varying step sizes. Our results show that these rates are within a logarithmic factor of the ones under independent data. We then illustrate the applications of these results to different interesting problems in multi-task reinforcement learning and federated learning.

OCMar 24, 2020
Finite-Time Analysis of Stochastic Gradient Descent under Markov Randomness

Thinh T. Doan, Lam M. Nguyen, Nhan H. Pham et al.

Motivated by broad applications in reinforcement learning and machine learning, this paper considers the popular stochastic gradient descent (SGD) when the gradients of the underlying objective function are sampled from Markov processes. This Markov sampling leads to the gradient samples being biased and not independent. The existing results for the convergence of SGD under Markov randomness are often established under the assumptions on the boundedness of either the iterates or the gradient samples. Our main focus is to study the finite-time convergence of SGD for different types of objective functions, without requiring these assumptions. We show that SGD converges nearly at the same rate with Markovian gradient samples as with independent gradient samples. The only difference is a logarithmic factor that accounts for the mixing time of the Markov chain.

LGDec 23, 2019
Finite-Time Analysis and Restarting Scheme for Linear Two-Time-Scale Stochastic Approximation

Thinh T. Doan

Motivated by their broad applications in reinforcement learning, we study the linear two-time-scale stochastic approximation, an iterative method using two different step sizes for finding the solutions of a system of two equations. Our main focus is to characterize the finite-time complexity of this method under time-varying step sizes and Markovian noise. In particular, we show that the mean square errors of the variables generated by the method converge to zero at a sublinear rate $\Ocal(k^{2/3})$, where $k$ is the number of iterations. We then improve the performance of this method by considering the restarting scheme, where we restart the algorithm after every predetermined number of iterations. We show that using this restarting method the complexity of the algorithm under time-varying step sizes is as good as the one using constant step sizes, but still achieving an exact converge to the desired solution. Moreover, the restarting scheme also helps to prevent the step sizes from getting too small, which is useful for the practical implementation of the linear two-time-scale stochastic approximation.

ROSep 12, 2019
A Reinforcement Learning Framework for Sequencing Multi-Robot Behaviors

Pietro Pierpaoli, Thinh T. Doan, Justin Romberg et al.

Given a list of behaviors and associated parameterized controllers for solving different individual tasks, we study the problem of selecting an optimal sequence of coordinated behaviors in multi-robot systems for completing a given mission, which could not be handled by any single behavior. In addition, we are interested in the case where partial information of the underlying mission is unknown, therefore, the robots must cooperatively learn this information through their course of actions. Such problem can be formulated as an optimal decision problem in multi-robot systems, however, it is in general intractable due to modeling imperfections and the curse of dimensionality of the decision variables. To circumvent these issues, we first consider an alternate formulation of the original problem through introducing a sequence of behaviors' switching times. Our main contribution is then to propose a novel reinforcement learning based method, that combines Q-learning and online gradient descent, for solving this reformulated problem. In particular, the optimal sequence of the robots' behaviors is found by using Q-learning while the optimal parameters of the associated controllers are obtained through an online gradient descent method. Finally, to illustrate the effectiveness of our proposed method we implement it on a team of differential-drive robots for solving two different missions, namely, convoy protection and object manipulation.

OCJul 25, 2019
Finite-Time Performance of Distributed Temporal Difference Learning with Linear Function Approximation

Thinh T. Doan, Siva Theja Maguluri, Justin Romberg

We study the policy evaluation problem in multi-agent reinforcement learning, modeled by a Markov decision process. In this problem, the agents operate in a common environment under a fixed control policy, working together to discover the value (global discounted accumulative reward) associated with each environmental state. Over a series of time steps, the agents act, get rewarded, update their local estimate of the value function, then communicate with their neighbors. The local update at each agent can be interpreted as a distributed variant of the popular temporal difference learning methods {\sf TD}$ (λ)$. Our main contribution is to provide a finite-analysis on the performance of this distributed {\sf TD}$(λ)$ algorithm for both constant and time-varying step sizes. The key idea in our analysis is to use the geometric mixing time $τ$ of the underlying Markov chain, that is, although the "noise" in our algorithm is Markovian, its dependence is very weak at samples spaced out at every $τ$. We provide an explicit upper bound on the convergence rate of the proposed method as a function of the network topology, the discount factor, the constant $λ$, and the mixing time $τ$. Our results also provide a mathematical explanation for observations that have appeared previously in the literature about the choice of $λ$. Our upper bound illustrates the trade-off between approximation accuracy and convergence speed implicit in the choice of $λ$. When $λ=1$, the solution will correspond to the best possible approximation of the value function, while choosing $λ= 0$ leads to faster convergence when the noise in the algorithm has large variance.

OCMay 27, 2019
Finite-Sample Analysis of Nonlinear Stochastic Approximation with Applications in Reinforcement Learning

Zaiwei Chen, Sheng Zhang, Thinh T. Doan et al.

Motivated by applications in reinforcement learning (RL), we study a nonlinear stochastic approximation (SA) algorithm under Markovian noise, and establish its finite-sample convergence bounds under various stepsizes. Specifically, we show that when using constant stepsize (i.e., $α_k\equiv α$), the algorithm achieves exponential fast convergence to a neighborhood (with radius $O(α\log(1/α))$) around the desired limit point. When using diminishing stepsizes with appropriate decay rate, the algorithm converges with rate $O(\log(k)/k)$. Our proof is based on Lyapunov drift arguments, and to handle the Markovian noise, we exploit the fast mixing of the underlying Markov chain. To demonstrate the generality of our theoretical results on Markovian SA, we use it to derive the finite-sample bounds of the popular $Q$-learning with linear function approximation algorithm, under a condition on the behavior policy. Importantly, we do not need to make the assumption that the samples are i.i.d., and do not require an artificial projection step in the algorithm to maintain the boundedness of the iterates. Numerical simulations corroborate our theoretical results.