SYFeb 20, 2015
Stability of Epidemic Models over Directed Graphs: A Positive Systems ApproachAli Khanafer, Tamer Başar, Bahman Gharesifard
We study the stability properties of a susceptible-infected-susceptible (SIS) diffusion model, so-called the $n$-intertwined Markov model, over arbitrary directed network topologies. As in the majority of the work on infection spread dynamics, this model exhibits a threshold phenomenon. When the curing rates in the network are high, the disease-free state is the unique equilibrium over the network. Otherwise, an endemic equilibrium state emerges, where some infection remains within the network. Using notions from positive systems theory, {we provide novel proofs for the global asymptotic stability of the equilibrium points in both cases over strongly connected networks based on the value of the basic reproduction number, a fundamental quantity in the study of epidemics.} When the network topology is weakly connected, we provide conditions for the existence, uniqueness, and global asymptotic stability of an endemic state, and we study the stability of the disease-free state. Finally, we demonstrate that the $n$-intertwined Markov model can be viewed as a best-response dynamical system of a concave game among the nodes. This characterization allows us to cast new infection spread dynamics; additionally, we provide a sufficient condition for the global convergence to the disease-free state, which can be checked in a distributed fashion. Several simulations demonstrate our results.
SYJan 14, 2017
Markov-Nash Equilibria in Mean-Field Games with Discounted CostNaci Saldi, Tamer Başar, Maxim Raginsky
In this paper, we consider discrete-time dynamic games of the mean-field type with a finite number $N$ of agents subject to an infinite-horizon discounted-cost optimality criterion. The state space of each agent is a locally compact Polish space. At each time, the agents are coupled through the empirical distribution of their states, which affects both the agents' individual costs and their state transition probabilities. We introduce a new solution concept of the Markov-Nash equilibrium, under which a policy is player-by-player optimal in the class of all Markov policies. Under mild assumptions, we demonstrate the existence of a mean-field equilibrium in the infinite-population limit $N \to \infty$, and then show that the policy obtained from the mean-field equilibrium is approximately Markov-Nash when the number of agents $N$ is sufficiently large.
LGNov 15, 2022
An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient MethodsYanli Liu, Kaiqing Zhang, Tamer Başar et al.
In this paper, we revisit and improve the convergence of policy gradient (PG), natural PG (NPG) methods, and their variance-reduced variants, under general smooth policy parametrizations. More specifically, with the Fisher information matrix of the policy being positive definite: i) we show that a state-of-the-art variance-reduced PG method, which has only been shown to converge to stationary points, converges to the globally optimal value up to some inherent function approximation error due to policy parametrization; ii) we show that NPG enjoys a lower sample complexity; iii) we propose SRVR-NPG, which incorporates variance-reduction into the NPG update. Our improvements follow from an observation that the convergence of (variance-reduced) PG and NPG methods can improve each other: the stationary convergence analysis of PG can be applied to NPG as well, and the global convergence analysis of NPG can help to establish the global convergence of (variance-reduced) PG methods. Our analysis carefully integrates the advantages of these two lines of works. Thanks to this improvement, we have also made variance-reduction for NPG possible, with both global convergence and an efficient finite-sample complexity.
SYDec 10, 2018
Distributed Discrete-time Optimization in Multi-agent Networks Using only Sign of Relative StateJiaqi Zhang, Keyou You, Tamer Başar
This paper proposes distributed discrete-time algorithms to cooperatively solve an additive cost optimization problem in multi-agent networks. The striking feature lies in the use of only the sign of relative state information between neighbors, which substantially differentiates our algorithms from others in the existing literature. We first interpret the proposed algorithms in terms of the penalty method in optimization theory and then perform non-asymptotic analysis to study convergence for static network graphs. Compared with the celebrated distributed subgradient algorithms, which however use the exact relative state information, the convergence speed is essentially not affected by the loss of information. We also study how introducing noise into the relative state information and randomly activated graphs affect the performance of our algorithms. Finally, we validate the theoretical results on a class of distributed quantile regression problems.
SYApr 3, 2018
Dynamic Power Distribution System Management With a Locally Connected Communication NetworkKaiqing Zhang, Wei Shi, Hao Zhu et al.
Coordinated optimization and control of distribution-level assets can enable a reliable and optimal integration of massive amount of distributed energy resources (DERs) and facilitate distribution system management (DSM). Accordingly, the objective is to coordinate the power injection at the DERs to maintain certain quantities across the network, e.g., voltage magnitude, line flows, or line losses, to be close to a desired profile. By and large, the performance of the DSM algorithms has been challenged by two factors: i) the possibly non strongly connected communication network over DERs that hinders the coordination; ii) the dynamics of the real system caused by the DERs with heterogeneous capabilities, time-varying operating conditions, and real-time measurement mismatches. In this paper, we investigate the modeling and algorithm design and analysis with the consideration of these two factors. In particular, a game-theoretic characterization is first proposed to account for a locally connected communication network over DERs, along with the analysis of the existence and uniqueness of the Nash equilibrium (NE) therein. To achieve the equilibrium in a distributed fashion, a projected-gradient-based asynchronous DSM algorithm is then advocated. The algorithm performance, including the convergence speed and the tracking error, is analytically guaranteed under the dynamic setting. Extensive numerical tests on both synthetic and realistic cases corroborate the analytical results derived.
LGAug 24, 2022
Oracle-free Reinforcement Learning in Mean-Field Games along a Single Sample PathMuhammad Aneeq uz Zaman, Alec Koppel, Sujay Bhatt et al.
We consider online reinforcement learning in Mean-Field Games (MFGs). Unlike traditional approaches, we alleviate the need for a mean-field oracle by developing an algorithm that approximates the Mean-Field Equilibrium (MFE) using the single sample path of the generic agent. We call this {\it Sandbox Learning}, as it can be used as a warm-start for any agent learning in a multi-agent non-cooperative setting. We adopt a two time-scale approach in which an online fixed-point recursion for the mean-field operates on a slower time-scale, in tandem with a control policy update on a faster time-scale for the generic agent. Given that the underlying Markov Decision Process (MDP) of the agent is communicating, we provide finite sample convergence guarantees in terms of convergence of the mean-field and control policy to the mean-field equilibrium. The sample complexity of the Sandbox learning algorithm is $\tilde{\mathcal{O}}(ε^{-4})$ where $ε$ is the MFE approximation error. This is similar to works which assume access to oracle. Finally, we empirically demonstrate the effectiveness of the sandbox learning algorithm in diverse scenarios, including those where the MDP does not necessarily have a single communicating class.
OCSep 9, 2023Code
Global Convergence of Receding-Horizon Policy Search in Learning Estimator DesignsXiangyuan Zhang, Saviz Mowlavi, Mouhacine Benosman et al.
We introduce the receding-horizon policy gradient (RHPG) algorithm, the first PG algorithm with provable global convergence in learning the optimal linear estimator designs, i.e., the Kalman filter (KF). Notably, the RHPG algorithm does not require any prior knowledge of the system for initialization and does not require the target system to be open-loop stable. The key of RHPG is that we integrate vanilla PG (or any other policy search directions) into a dynamic programming outer loop, which iteratively decomposes the infinite-horizon KF problem that is constrained and non-convex in the policy parameter into a sequence of static estimation problems that are unconstrained and strongly-convex, thus enabling global convergence. We further provide fine-grained analyses of the optimization landscape under RHPG and detail the convergence and sample complexity guarantees of the algorithm. This work serves as an initial attempt to develop reinforcement learning algorithms specifically for control applications with performance guarantees by utilizing classic control theory in both algorithmic design and theoretical analyses. Lastly, we validate our theories by deploying the RHPG algorithm to learn the Kalman filter design of a large-scale convection-diffusion model. We open-source the code repository at \url{https://github.com/xiangyuan-zhang/LearningKF}.
OCFeb 25, 2023
Revisiting LQR Control from the Perspective of Receding-Horizon Policy GradientXiangyuan Zhang, Tamer Başar
We revisit in this paper the discrete-time linear quadratic regulator (LQR) problem from the perspective of receding-horizon policy gradient (RHPG), a newly developed model-free learning framework for control applications. We provide a fine-grained sample complexity analysis for RHPG to learn a control policy that is both stabilizing and $ε$-close to the optimal LQR solution, and our algorithm does not require knowing a stabilizing control policy for initialization. Combined with the recent application of RHPG in learning the Kalman filter, we demonstrate the general applicability of RHPG in linear control and estimation with streamlined analyses.
GTMar 2, 2018
Generalized Colonel Blotto GameAidin Ferdowsi, Anibal Sanjab, Walid Saad et al.
Competitive resource allocation between adversarial decision makers arises in a wide spectrum of real-world applications such as in communication systems, cyber-physical systems security, as well as financial, political, and electoral competition. As such, developing analytical tools to model and analyze competitive resource allocation is crucial for devising optimal allocation strategies and anticipating the potential outcomes of the competition. To this end, the Colonel Blotto game is one of the most popular game-theoretic frameworks for modeling and analyzing such competitive resource allocation problems. However, in many real-world competitive situations, the Colonel Blotto game does not admit solutions in deterministic strategies and, hence, one must rely on analytically complex mixed-strategies with their associated tractability, applicability, and practicality challenges. In this paper, a generalization of the Colonel Blotto game which enables the derivation of deterministic, practical, and implementable equilibrium strategies is proposed while accounting for the heterogeneity of the battlefields. In addition, the proposed generalized game enables accounting for the consumed resources in each battlefield, a feature that is not considered in the classical Blotto game. For the generalized game, the existence of a Nash equilibrium in pure-strategies is shown. Then, closed-form analytical expressions of the equilibrium strategies, are derived and the outcome of the game is characterized; based on the number of resources of each player as well as the valuation of each battlefield. The generated results provide invaluable insights on the outcome of the competition. For example, the results show that, when both players are fully rational, the more resourceful player can achieve a better total payoff at the Nash equilibrium, a result that is not mimicked in the classical Blotto game.
OCMar 30, 2013
Robust Distributed Averaging on Networks with Adversarial InterventionAli Khanafer, Behrouz Touri, Tamer Başar
We study the interaction between a network designer and an adversary over a dynamical network. The network consists of nodes performing continuous-time distributed averaging. The goal of the network designer is to assist the nodes reach consensus by changing the weights of a limited number of links in the network. Meanwhile, an adversary strategically disconnects a set of links to prevent the nodes from converging. We formulate two problems to describe this competition where the order in which the players act is reversed in the two problems. We utilize Pontryagin's Maximum Principle (MP) to tackle both problems and derive the optimal strategies. Although the canonical equations provided by the MP are intractable, we provide an alternative characterization for the optimal strategies that highlights a connection with potential theory. Finally, we provide a sufficient condition for the existence of a saddle-point equilibrium (SPE) for this zero-sum game.
GTFeb 6, 2011
Adaptive Resource Allocation in Jamming Teams Using Game TheoryAli Khanafer, Sourabh Bhattacharya, Tamer Başar
In this work, we study the problem of power allocation and adaptive modulation in teams of decision makers. We consider the special case of two teams with each team consisting of two mobile agents. Agents belonging to the same team communicate over wireless ad hoc networks, and they try to split their available power between the tasks of communication and jamming the nodes of the other team. The agents have constraints on their total energy and instantaneous power usage. The cost function adopted is the difference between the rates of erroneously transmitted bits of each team. We model the adaptive modulation problem as a zero-sum matrix game which in turn gives rise to a a continuous kernel game to handle power control. Based on the communications model, we present sufficient conditions on the physical parameters of the agents for the existence of a pure strategy saddle-point equilibrium (PSSPE).
SYNov 30, 2023Code
Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning AlgorithmsXiangyuan Zhang, Weichao Mao, Saviz Mowlavi et al.
We introduce controlgym, a library of thirty-six industrial control settings, and ten infinite-dimensional partial differential equation (PDE)-based control problems. Integrated within the OpenAI Gym/Gymnasium (Gym) framework, controlgym allows direct applications of standard reinforcement learning (RL) algorithms like stable-baselines3. Our control environments complement those in Gym with continuous, unbounded action and observation spaces, motivated by real-world control applications. Moreover, the PDE control environments uniquely allow the users to extend the state dimensionality of the system to infinity while preserving the intrinsic dynamics. This feature is crucial for evaluating the scalability of RL algorithms for control. This project serves the learning for dynamics & control (L4DC) community, aiming to explore key questions: the convergence of RL algorithms in learning control policies; the stability and robustness issues of learning-based controllers; and the scalability of RL algorithms to high- and potentially infinite-dimensional systems. We open-source the controlgym project at https://github.com/xiangyuan-zhang/controlgym.
GTJan 31, 2011
Power Allocation in Team Jamming Games in Wireless Ad Hoc NetworksSourabh Bhattacharya, Ali Khanafer, Tamer Başar
In this work, we study the problem of power allocation in teams. Each team consists of two agents who try to split their available power between the tasks of communication and jamming the nodes of the other team. The agents have constraints on their total energy and instantaneous power usage. The cost function is the difference between the rates of erroneously transmitted bits of each team. We model the problem as a zero-sum differential game between the two teams and use {\it{Isaacs'}} approach to obtain the necessary conditions for the optimal trajectories. This leads to a continuous-kernel power allocation game among the players. Based on the communications model, we present sufficient conditions on the physical parameters of the agents for the existence of a pure strategy Nash equilibrium (PSNE). Finally, we present simulation results for the case when the agents are holonomic.
GTSep 29, 2017
Strategic Communication Between Prospect Theoretic Agents over a Gaussian Test ChannelVenkata Sriram Siddhardh Nadendla, Emrah Akyol, Cedric Langbort et al.
In this paper, we model a Stackelberg game in a simple Gaussian test channel where a human transmitter (leader) communicates a source message to a human receiver (follower). We model human decision making using prospect theory models proposed for continuous decision spaces. Assuming that the value function is the squared distortion at both the transmitter and the receiver, we analyze the effects of the weight functions at both the transmitter and the receiver on optimal communication strategies, namely encoding at the transmitter and decoding at the receiver, in the Stackelberg sense. We show that the optimal strategies for the behavioral agents in the Stackelberg sense are identical to those designed for unbiased agents. At the same time, we also show that the prospect-theoretic distortions at both the transmitter and the receiver are both larger than the expected distortion, thus making behavioral agents less contended than unbiased agents. Consequently, the presence of cognitive biases increases the need for transmission power in order to achieve a given distortion at both transmitter and receiver.
SYFeb 20, 2015
Robust Distributed Averaging: When are Potential-Theoretic Strategies Optimal?Ali Khanafer, Tamer Başar
We study the interaction between a network designer and an adversary over a dynamical network. The network consists of nodes performing continuous-time distributed averaging. The adversary strategically disconnects a set of links to prevent the nodes from reaching consensus. Meanwhile, the network designer assists the nodes in reaching consensus by changing the weights of a limited number of links in the network. We formulate two Stackelberg games to describe this competition where the order in which the players act is reversed in the two problems. Although the canonical equations provided by the Pontryagin's maximum principle seem to be intractable, we provide an alternative characterization for the optimal strategies that makes connection to potential theory. Finally, we provide a sufficient condition for the existence of a saddle-point equilibrium for the underlying zero-sum game.
SYApr 3, 2018
Distributed Equilibrium-Learning for Power Network Voltage Control With a Locally Connected Communication NetworkKaiqing Zhang, Wei Shi, Hao Zhu et al.
In current power distribution systems, one of the most challenging operation tasks is to coordinate the network- wide distributed energy resources (DERs) to maintain the stability of voltage magnitude of the system. This voltage control task has been investigated actively under either distributed optimization-based or local feedback control-based characterizations. The former architecture requires a strongly-connected communication network among all DERs for implementing the optimization algorithms, a scenario not yet realistic in most of the existing distribution systems with under-deployed communication infrastructure. The latter one, on the other hand, has been proven to suffer from loss of network-wide op- erational optimality. In this paper, we propose a game-theoretic characterization for semi-local voltage control with only a locally connected communication network. We analyze the existence and uniqueness of the generalized Nash equilibrium (GNE) for this characterization and develop a fully distributed equilibrium-learning algorithm that relies on only neighbor-to-neighbor information exchange. Provable convergence results are provided along with numerical tests which corroborate the robust convergence property of the proposed algorithm.
OCApr 2, 2017
Cash-settled options for wholesale electricity marketsKhaled Alshehri, Subhonmesh Bose, Tamer Başar
Wholesale electricity market designs in practice do not provide the market participants with adequate mechanisms to hedge their financial risks. Demanders and suppliers will likely face even greater risks with the deepening penetration of variable renewable resources like wind and solar. This paper explores the design of a centralized cash-settled call option market to mitigate such risks. A cash-settled call option is a financial instrument that allows its holder the right to claim a monetary reward equal to the positive difference between the real-time price of an underlying commodity and a pre-negotiated strike price for an upfront fee. Through an example, we illustrate that a bilateral call option can reduce the payment volatility of market participants. Then, we design a centralized clearing mechanism for call options that generalizes the bilateral trade. We illustrate through an example how the centralized clearing mechanism generalizes the bilateral trade. Finally, the effect of risk preference of the market participants, as well as some generalizations are discussed.
OCOct 10, 2022
Towards a Theoretical Foundation of Policy Optimization for Learning Control PoliciesBin Hu, Kaiqing Zhang, Na Li et al.
Gradient-based methods have been widely used for system design and optimization in diverse application domains. Recently, there has been a renewed interest in studying theoretical properties of these methods in the context of control and reinforcement learning. This article surveys some of the recent developments on policy optimization, a gradient-based iterative approach for feedback control synthesis, popularized by successes of reinforcement learning. We take an interdisciplinary perspective in our exposition that connects control theory, reinforcement learning, and large-scale optimization. We review a number of recently-developed theoretical results on the optimization landscape, global convergence, and sample complexity of gradient-based methods for various continuous control problems such as the linear quadratic regulator (LQR), $\mathcal{H}_\infty$ control, risk-sensitive control, linear quadratic Gaussian (LQG) control, and output feedback synthesis. In conjunction with these optimization results, we also discuss how direct policy optimization handles stability and robustness concerns in learning-based control, two main desiderata in control engineering. We conclude the survey by pointing out several challenges and opportunities at the intersection of learning and control.
SYJan 5, 2018
Secure Sensor Design Against Undetected Infiltration: Minimum Impact-Minimum DamageMuhammed O. Sayin, Tamer Başar
We propose a new defense mechanism against undetected infiltration into controllers in cyber-physical systems. To this end, we cautiously design the outputs of the sensors that monitor the state of the system. Different from the defense mechanisms that seek to detect infiltration, the proposed approach seeks to minimize the damage of possible attacks before they have been detected. Controller of a cyber-physical system could have been infiltrated into by an undetected attacker at any time of the operation. Disregarding such a possibility and disclosing system's state without caution benefits the attacker in his/her malicious objective. Therefore, secure sensor design can improve the security of cyber-physical systems further when incorporated along with other defense mechanisms. We, specifically, consider a controlled Gauss-Markov process, where the controller could have been infiltrated into at any time within the system's operation. In the sense of game-theoretic hierarchical equilibrium, we provide a semi-definite programming based algorithm to compute the optimal linear secure sensor outputs and analyze the performance for various scenarios numerically.
SYFeb 4, 2020
Graph-Theoretic Framework for Unified Analysis of Observability and Data Injection Attacks in the Smart GridAnibal Sanjab, Walid Saad, Tamer Başar
In this paper, a novel graph-theoretic framework is proposed to generalize the analysis of a broad set of security attacks, including observability and data injection attacks, that target the state estimator of a smart grid. First, the notion of observability attacks is defined based on a proposed graph-theoretic construct. In this respect, a structured approach is proposed to characterize critical sets, whose removal renders the system unobservable. It is then shown that, for the system to be observable, these critical sets must be part of a maximum matching over a proposed bipartite graph. In addition, it is shown that stealthy data injection attacks (SDIAs) constitute a special case of these observability attacks. Then, various attack strategies and defense policies, for observability and data injection attacks, are shown to be amenable to analysis using the introduced graph-theoretic framework. The proposed framework is then shown to provide a unified basis for analysis of four key security problems (among others), pertaining to the characterization of: 1) The sparsest SDIA; 2) the sparsest SDIA including a certain measurement; 3) a set of measurements which must be defended to thwart all potential SDIAs; and 4) the set of measurements, which when protected, can thwart any SDIA whose cardinality is below a certain threshold. A case study using the IEEE 14-bus system with a set of 17 measurements is used to support the theoretical findings.
OCJan 30, 2023
Learning the Kalman Filter with Fine-Grained Sample ComplexityXiangyuan Zhang, Bin Hu, Tamer Başar
We develop the first end-to-end sample complexity of model-free policy gradient (PG) methods in discrete-time infinite-horizon Kalman filtering. Specifically, we introduce the receding-horizon policy gradient (RHPG-KF) framework and demonstrate $\tilde{\mathcal{O}}(ε^{-2})$ sample complexity for RHPG-KF in learning a stabilizing filter that is $ε$-close to the optimal Kalman filter. Notably, the proposed RHPG-KF framework does not require the system to be open-loop stable nor assume any prior knowledge of a stabilizing filter. Our results shed light on applying model-free PG methods to control a linear dynamical system where the state measurements could be corrupted by statistical noises and other (possibly adversarial) disturbances.
OCJun 6, 2022
Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPsDongsheng Ding, Kaiqing Zhang, Jiali Duan et al.
We study the sequential decision making problem of maximizing the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon optimal control problem for Constrained Markov Decision Processes (constrained MDPs). Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method that updates the primal variable via natural policy gradient ascent and the dual variable via projected subgradient descent. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, under the softmax policy parametrization, we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such convergence is independent of the size of the state-action space, i.e., it is~dimension-free. Furthermore, for log-linear and general smooth policy parametrizations, we establish sublinear convergence rates up to a function approximation error caused by restricted policy parametrization. We also provide convergence and finite-sample complexity guarantees for two sample-based NPG-PD algorithms. We use a set of computational experiments to showcase the effectiveness of our approach.
GTSep 17, 2012
Nash Equilibria for Stochastic Games with Asymmetric Information-Part 1: Finite GamesAshutosh Nayyar, Abhishek Gupta, Cédric Langbort et al.
A model of stochastic games where multiple controllers jointly control the evolution of the state of a dynamic system but have access to different information about the state and action processes is considered. The asymmetry of information among the controllers makes it difficult to compute or characterize Nash equilibria. Using common information among the controllers, the game with asymmetric information is shown to be equivalent to another game with symmetric information. Further, under certain conditions, a Markov state is identified for the equivalent symmetric information game and its Markov perfect equilibria are characterized. This characterization provides a backward induction algorithm to find Nash equilibria of the original game with asymmetric information in pure or behavioral strategies. Each step of this algorithm involves finding Bayesian Nash equilibria of a one-stage Bayesian game. The class of Nash equilibria of the original game that can be characterized in this backward manner are named common information based Markov perfect equilibria.
DCApr 12, 2024Code
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length PredictionHaoran Qiu, Weichao Mao, Archit Patke et al.
Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the autoregressive nature of generative models. Existing LLM serving systems exploit first-come-first-serve (FCFS) scheduling, suffering from head-of-line blocking issues. To address the non-deterministic nature of LLMs and enable efficient interactive LLM serving, we present a speculative shortest-job-first (SSJF) scheduler that uses a light proxy model to predict LLM output sequence lengths. Our open-source SSJF implementation does not require changes to memory management or batching strategies. Evaluations on real-world datasets and production workload traces show that SSJF reduces average job completion times by 30.5-39.6% and increases throughput by 2.2-3.6x compared to FCFS schedulers, across no batching, dynamic batching, and continuous batching settings.
3.8GTApr 18
A Stackelberg Game Framework with Drainability Guardrails for Pricing and Scaling in Multi-Tenant GPU Cloud PlatformsJunji Yan, Asrin Efe Yorulmaz, Hanchen Zhou et al.
Modern Graphics Processing Unit (GPU)-backed services must satisfy strict latency service-level objectives (SLOs) while controlling spare-capacity cost. In multi-tenant GPU cloud platforms, this trade-off is inherently dynamic because workload demand is endogenous; specifically, pricing shapes the submissions of heterogeneous tenants, which subsequently impact congestion and delay. We formulate the joint pricing-and-scaling problem as a large-population Stackelberg game problem, and we derive an explicit equilibrium demand map. The resulting closed-loop model reveals a structural failure mode in which delay-insensitive workloads sustain a residual demand floor, making the backlog undrainable under bounded price and service capacity. This observation motivates a computable drainability guardrail that certifies uniformly negative drift in the residual-demand regime. For any fixed price-capacity pair satisfying the drainability guardrail, we establish a unique operating point and global convergence towards it under a checkable step-size condition. Building on this fixed-pair analysis, we further develop an optimizer-agnostic action shield for the full dynamic problem and show empirically that it improves safety and robustness for model-free reinforcement learning (RL) in this setting.
3.8SYMay 18
Cooperative and Noncooperative Paradigms for Game-Theoretic Control of Socio-Technical SystemsTamer Başar, Tomohisa Hayakawa, Hideaki Ishii et al.
This tutorial presents cooperative and noncooperative game-theoretic frameworks for modeling, learning, and control in socio-technical systems, where human behavior, incentives, institutions, and social interactions are coupled with cyber-physical and networked infrastructures. The paper reviews strategic, dynamic, cooperative, matching, learning, and feedback-control approaches for analyzing how local decision-making, adaptation, and strategic interactions shape collective system outcomes. The tutorial further develops feedback-learning and incentive-design perspectives that connect equilibrium analysis with adaptation, distributed control, and mechanism design under information and coordination constraints. We also examine resilience and security challenges arising from adversarial behavior, misinformation, disruptions, and cascading failures in interconnected socio-technical networks. Finally, we discuss emerging research directions at the intersection of game theory, control, learning, and network science for resilient and adaptive socio-technical systems.
SYNov 30, 2025
The Silence that Speaks: Neural Estimation via Communication GapsShubham Aggarwal, Dipankar Maity, Tamer Başar
Accurate remote state estimation is a fundamental component of many autonomous and networked dynamical systems, where multiple decision-making agents interact and communicate over shared, bandwidth-constrained channels. These communication constraints introduce an additional layer of complexity, namely, the decision of when to communicate. This results in a fundamental trade-off between estimation accuracy and communication resource usage. Traditional extensions of classical estimation algorithms (e.g., the Kalman filter) treat the absence of communication as 'missing' information. However, silence itself can carry implicit information about the system's state, which, if properly interpreted, can enhance the estimation quality even in the absence of explicit communication. Leveraging this implicit structure, however, poses significant analytical challenges, even in relatively simple systems. In this paper, we propose CALM (Communication-Aware Learning and Monitoring), a novel learning-based framework that jointly addresses the dual challenges of communication scheduling and estimator design. Our approach entails learning not only when to communicate but also how to infer useful information from periods of communication silence. We perform comparative case studies on multiple benchmarks to demonstrate that CALM is able to decode the implicit coordination between the estimator and the scheduler to extract information from the instances of 'silence' and enhance the estimation accuracy.
GTNov 4, 2025
Near Optimal Convergence to Coarse Correlated Equilibrium in General-Sum Markov GamesAsrin Efe Yorulmaz, Tamer Başar
No-regret learning dynamics play a central role in game theory, enabling decentralized convergence to equilibrium for concepts such as Coarse Correlated Equilibrium (CCE) or Correlated Equilibrium (CE). In this work, we improve the convergence rate to CCE in general-sum Markov games, reducing it from the previously best-known rate of $\mathcal{O}(\log^5 T / T)$ to a sharper $\mathcal{O}(\log T / T)$. This matches the best known convergence rate for CE in terms of $T$, number of iterations, while also improving the dependence on the action set size from polynomial to polylogarithmic-yielding exponential gains in high-dimensional settings. Our approach builds on recent advances in adaptive step-size techniques for no-regret algorithms in normal-form games, and extends them to the Markovian setting via a stage-wise scheme that adjusts learning rates based on real-time feedback. We frame policy updates as an instance of Optimistic Follow-the-Regularized-Leader (OFTRL), customized for value-iteration-based learning. The resulting self-play algorithm achieves, to our knowledge, the fastest known convergence rate to CCE in Markov games.
OCApr 12, 2025Code
InterQ: A DQN Framework for Optimal Intermittent ControlShubham Aggarwal, Dipankar Maity, Tamer Başar
In this letter, we explore the communication-control co-design of discrete-time stochastic linear systems through reinforcement learning. Specifically, we examine a closed-loop system involving two sequential decision-makers: a scheduler and a controller. The scheduler continuously monitors the system's state but transmits it to the controller intermittently to balance the communication cost and control performance. The controller, in turn, determines the control input based on the intermittently received information. Given the partially nested information structure, we show that the optimal control policy follows a certainty-equivalence form. Subsequently, we analyze the qualitative behavior of the scheduling policy. To develop the optimal scheduling policy, we propose InterQ, a deep reinforcement learning algorithm which uses a deep neural network to approximate the Q-function. Through extensive numerical evaluations, we analyze the scheduling landscape and further compare our approach against two baseline strategies: (a) a multi-period periodic scheduling policy, and (b) an event-triggered policy. The results demonstrate that our proposed method outperforms both baselines. The open source implementation can be found at https://github.com/AC-sh/InterQ.
SYApr 3, 2024
Decision Transformer as a Foundation Model for Partially Observable Continuous ControlXiangyuan Zhang, Weichao Mao, Haoran Qiu et al.
Closed-loop control of nonlinear dynamical systems with partial-state observability demands expert knowledge of a diverse, less standardized set of theoretical tools. Moreover, it requires a delicate integration of controller and estimator designs to achieve the desired system behavior. To establish a general controller synthesis framework, we explore the Decision Transformer (DT) architecture. Specifically, we first frame the control task as predicting the current optimal action based on past observations, actions, and rewards, eliminating the need for a separate estimator design. Then, we leverage the pre-trained language models, i.e., the Generative Pre-trained Transformer (GPT) series, to initialize DT and subsequently train it for control tasks using low-rank adaptation (LoRA). Our comprehensive experiments across five distinct control tasks, ranging from maneuvering aerospace systems to controlling partial differential equations (PDEs), demonstrate DT's capability to capture the parameter-agnostic structures intrinsic to control tasks. DT exhibits remarkable zero-shot generalization abilities for completely new tasks and rapidly surpasses expert performance levels with a minimal amount of demonstration data. These findings highlight the potential of DT as a foundational controller for general control applications.
LGMar 17, 2024
Independent RL for Cooperative-Competitive Agents: A Mean-Field PerspectiveMuhammad Aneeq uz Zaman, Alec Koppel, Mathieu Laurière et al.
We address in this paper Reinforcement Learning (RL) among agents that are grouped into teams such that there is cooperation within each team but general-sum (non-zero sum) competition across different teams. To develop an RL method that provably achieves a Nash equilibrium, we focus on a linear-quadratic structure. Moreover, to tackle the non-stationarity induced by multi-agent interactions in the finite population setting, we consider the case where the number of agents within each team is infinite, i.e., the mean-field setting. This results in a General-Sum LQ Mean-Field Type Game (GS-MFTG). We characterize the Nash equilibrium (NE) of the GS-MFTG, under a standard invertibility condition. This MFTG NE is then shown to be $O(1/M)$-NE for the finite population game where $M$ is a lower bound on the number of agents in each team. These structural results motivate an algorithm called Multi-player Receding-horizon Natural Policy Gradient (MRNPG), where each team minimizes its cumulative cost \emph{independently} in a receding-horizon manner. Despite the non-convexity of the problem, we establish that the resulting algorithm converges to a global NE through a novel problem decomposition into sub-problems using backward recursive discrete-time Hamilton-Jacobi-Isaacs (HJI) equations, in which \emph{independent natural policy gradient} is shown to exhibit linear convergence under time-independent diagonal dominance. Numerical studies included corroborate the theoretical results.
GTMar 25, 2024
Policy Optimization finds Nash Equilibrium in Regularized General-Sum LQ GamesMuhammad Aneeq uz Zaman, Shubham Aggarwal, Melih Bastopcu et al.
In this paper, we investigate the impact of introducing relative entropy regularization on the Nash Equilibria (NE) of General-Sum $N$-agent games, revealing the fact that the NE of such games conform to linear Gaussian policies. Moreover, it delineates sufficient conditions, contingent upon the adequacy of entropy regularization, for the uniqueness of the NE within the game. As Policy Optimization serves as a foundational approach for Reinforcement Learning (RL) techniques aimed at finding the NE, in this work we prove the linear convergence of a policy optimization algorithm which (subject to the adequacy of entropy regularization) is capable of provably attaining the NE. Furthermore, in scenarios where the entropy regularization proves insufficient, we present a $δ$-augmentation technique, which facilitates the achievement of an $ε$-NE within the game.
GTMar 13, 2024
Learning How to Strategically Disclose InformationRaj Kiriti Velicheti, Melih Bastopcu, S. Rasoul Etesami et al.
Strategic information disclosure, in its simplest form, considers a game between an information provider (sender) who has access to some private information that an information receiver is interested in. While the receiver takes an action that affects the utilities of both players, the sender can design information (or modify beliefs) of the receiver through signal commitment, hence posing a Stackelberg game. However, obtaining a Stackelberg equilibrium for this game traditionally requires the sender to have access to the receiver's objective. In this work, we consider an online version of information design where a sender interacts with a receiver of an unknown type who is adversarially chosen at each round. Restricting attention to Gaussian prior and quadratic costs for the sender and the receiver, we show that $\mathcal{O}(\sqrt{T})$ regret is achievable with full information feedback, where $T$ is the total number of interactions between the sender and the receiver. Further, we propose a novel parametrization that allows the sender to achieve $\mathcal{O}(\sqrt{T})$ regret for a general convex utility function. We then consider the Bayesian Persuasion problem with an additional cost term in the objective function, which penalizes signaling policies that are more informative and obtain $\mathcal{O}(\log(T))$ regret. Finally, we establish a sublinear regret bound for the partial information feedback setting and provide simulations to support our theoretical results.
SYMar 1, 2024
Policy Optimization for PDE Control with a Warm StartXiangyuan Zhang, Saviz Mowlavi, Mouhacine Benosman et al.
Dimensionality reduction is crucial for controlling nonlinear partial differential equations (PDE) through a "reduce-then-design" strategy, which identifies a reduced-order model and then implements model-based control solutions. However, inaccuracies in the reduced-order modeling can substantially degrade controller performance, especially in PDEs with chaotic behavior. To address this issue, we augment the reduce-then-design procedure with a policy optimization (PO) step. The PO step fine-tunes the model-based controller to compensate for the modeling error from dimensionality reduction. This augmentation shifts the overall strategy into reduce-then-design-then-adapt, where the model-based controller serves as a warm start for PO. Specifically, we study the state-feedback tracking control of PDEs that aims to align the PDE state with a specific constant target subject to a linear-quadratic cost. Through extensive experiments, we show that a few iterations of PO can significantly improve the model-based controller performance. Our approach offers a cost-effective alternative to PDE control using end-to-end reinforcement learning.
LGApr 17, 2024
Control Theoretic Approach to Fine-Tuning and Transfer LearningErkan Bayram, Shenyu Liu, Mohamed-Ali Belabbas et al.
Given a training set in the form of a paired $(\mathcal{X},\mathcal{Y})$, we say that the control system $\dot x = f(x,u)$ has learned the paired set via the control $u^*$ if the system steers each point of $\mathcal{X}$ to its corresponding target in $\mathcal{Y}$. If the training set is expanded, most existing methods for finding a new control $u^*$ require starting from scratch, resulting in a quadratic increase in complexity with the number of points. To overcome this limitation, we introduce the concept of $\textit{ tuning without forgetting}$. We develop $\textit{an iterative algorithm}$ to tune the control $u^*$ when the training set expands, whereby points already in the paired set are still matched, and new training samples are learned. At each update of our method, the control $u^*$ is projected onto the kernel of the end-point mapping generated by the controlled dynamics at the learned samples. It ensures keeping the end-points for the previously learned samples constant while iteratively learning additional samples.
LGSep 22, 2025
Control Disturbance Rejection in Neural ODEsErkan Bayram, Mohamed-Ali Belabbas, Tamer Başar
In this paper, we propose an iterative training algorithm for Neural ODEs that provides models resilient to control (parameter) disturbances. The method builds on our earlier work Tuning without Forgetting-and similarly introduces training points sequentially, and updates the parameters on new data within the space of parameters that do not decrease performance on the previously learned training points-with the key difference that, inspired by the concept of flat minima, we solve a minimax problem for a non-convex non-concave functional over an infinite-dimensional control space. We develop a projected gradient descent algorithm on the space of parameters that admits the structure of an infinite-dimensional Banach subspace. We show through simulations that this formulation enables the model to effectively learn new data points and gain robustness against control disturbance.
LGSep 3, 2025
Geometric Foundations of Tuning without Forgetting in Neural ODEsErkan Bayram, Mohamed-Ali Belabbas, Tamer Başar
In our earlier work, we introduced the principle of Tuning without Forgetting (TwF) for sequential training of neural ODEs, where training samples are added iteratively and parameters are updated within the subspace of control functions that preserves the end-point mapping at previously learned samples on the manifold of output labels in the first-order approximation sense. In this letter, we prove that this parameter subspace forms a Banach submanifold of finite codimension under nonsingular controls, and we characterize its tangent space. This reveals that TwF corresponds to a continuation/deformation of the control function along the tangent space of this Banach submanifold, providing a theoretical foundation for its mapping-preserving (not forgetting) during the sequential training exactly, beyond first-order approximation.
LGNov 7, 2024
Structure Matters: Dynamic Policy GradientSara Klein, Xiangyuan Zhang, Tamer Başar et al.
In this work, we study $γ$-discounted infinite-horizon tabular Markov decision processes (MDPs) and introduce a framework called dynamic policy gradient (DynPG). The framework directly integrates dynamic programming with (any) policy gradient method, explicitly leveraging the Markovian property of the environment. DynPG dynamically adjusts the problem horizon during training, decomposing the original infinite-horizon MDP into a sequence of contextual bandit problems. By iteratively solving these contextual bandits, DynPG converges to the stationary optimal policy of the infinite-horizon MDP. To demonstrate the power of DynPG, we establish its non-asymptotic global convergence rate under the tabular softmax parametrization, focusing on the dependencies on salient but essential parameters of the MDP. By combining classical arguments from dynamic programming with more recent convergence arguments of policy gradient schemes, we prove that softmax DynPG scales polynomially in the effective horizon $(1-γ)^{-1}$. Our findings contrast recent exponential lower bound examples for vanilla policy gradient.
4.4OCMar 25
Variational Contraction Conditions for Iterative Algorithms in Multi-Population Discrete-Time Regularized Mean-Field GamesUğur Aydın, Tamer Başar
In this work, we study the contraction conditions of iterative algorithms for stationary and finite-horizon discrete-time regularized mean-field games (MFGs) with multiple populations, where each population only interacts with the state distributions of the other populations. Due to the high dimensionality caused by the interaction of different populations, contraction rates for these algorithms cannot, in general, be expressed in terms of radicals. By studying the dynamics of these iterative algorithms and assuming that the system components of each population's MFG are Lipschitz continuous, we present explicit (eventual) contraction conditions for each algorithm in any normed space, relying only on these Lipschitz parameters. As a consequence of these contraction conditions, we provide convergence rates of finite-horizon mean-field equilibria to infinite-horizon stationary (and non-stationary) mean-field equilibria (MFEs), under restrictions on a variational characterization of the dynamics of these iterative algorithms. In the single-population case, the restrictions we impose on this variational characterization to obtain these convergence results are less restrictive than previous results in the literature.
GTAug 29, 2025
A Soft Inducement Framework for Incentive-Aided Steering of No-Regret PlayersAsrin Efe Yorulmaz, Raj Kiriti Velicheti, Melih Bastopcu et al.
In this work, we investigate a steering problem in a mediator-augmented two-player normal-form game, where the mediator aims to guide players toward a specific action profile through information and incentive design. We first characterize the games for which successful steering is possible. Moreover, we establish that steering players to any desired action profile is not always achievable with information design alone, nor when accompanied with sublinear payment schemes. Consequently, we derive a lower bound on the constant payments required per round to achieve this goal. To address these limitations incurred with information design, we introduce an augmented approach that involves a one-shot information design phase before the start of the repeated game, transforming the prior interaction into a Stackelberg game. Finally, we theoretically demonstrate that this approach improves the convergence rate of players' action profiles to the target point by a constant factor with high probability, and support it with empirical results.
GTAug 26, 2025
Aggregate Fictitious Play for Learning in Anonymous Polymatrix Games (Extended Version)Semih Kara, Tamer Başar
Fictitious play (FP) is a well-studied algorithm that enables agents to learn Nash equilibrium in games with certain reward structures. However, when agents have no prior knowledge of the reward functions, FP faces a major challenge: the joint action space grows exponentially with the number of agents, which slows down reward exploration. Anonymous games offer a structure that mitigates this issue. In these games, the rewards depend only on the actions taken; not on who is taking which action. Under such a structure, we introduce aggregate fictitious play (agg-FP), a variant of FP where each agent tracks the frequency of the number of other agents playing each action, rather than these agents' individual actions. We show that in anonymous polymatrix games, agg-FP converges to a Nash equilibrium under the same conditions as classical FP. In essence, by aggregating the agents' actions, we reduce the action space without losing the convergence guarantees. Using simulations, we provide empirical evidence on how this reduction accelerates convergence.
GTFeb 2, 2024
$\widetilde{O}(T^{-1})$ Convergence to (Coarse) Correlated Equilibria in Full-Information General-Sum Markov GamesWeichao Mao, Haoran Qiu, Chen Wang et al.
No-regret learning has a long history of being closely connected to game theory. Recent works have devised uncoupled no-regret learning dynamics that, when adopted by all the players in normal-form games, converge to various equilibrium solutions at a near-optimal rate of $\widetilde{O}(T^{-1})$, a significant improvement over the $O(1/\sqrt{T})$ rate of classic no-regret learners. However, analogous convergence results are scarce in Markov games, a more generic setting that lays the foundation for multi-agent reinforcement learning. In this work, we close this gap by showing that the optimistic-follow-the-regularized-leader (OFTRL) algorithm, together with appropriate value update procedures, can find $\widetilde{O}(T^{-1})$-approximate (coarse) correlated equilibria in full-information general-sum Markov games within $T$ iterations. Numerical results are also included to corroborate our theoretical findings.
GTDec 15, 2021
Finite-Sample Analysis of Decentralized Q-Learning for Stochastic GamesZuguang Gao, Qianqian Ma, Tamer Başar et al.
Learning in stochastic games is arguably the most standard and fundamental setting in multi-agent reinforcement learning (MARL). In this paper, we consider decentralized MARL in stochastic games in the non-asymptotic regime. In particular, we establish the finite-sample complexity of fully decentralized Q-learning algorithms in a significant class of general-sum stochastic games (SGs) - weakly acyclic SGs, which includes the common cooperative MARL setting with an identical reward to all agents (a Markov team problem) as a special case. We focus on the practical while challenging setting of fully decentralized MARL, where neither the rewards nor the actions of other agents can be observed by each agent. In fact, each agent is completely oblivious to the presence of other decision makers. Both the tabular and the linear function approximation cases have been considered. In the tabular setting, we analyze the sample complexity for the decentralized Q-learning algorithm to converge to a Markov perfect equilibrium (Nash equilibrium). With linear function approximation, the results are for convergence to a linear approximated equilibrium - a new notion of equilibrium that we propose - which describes that each agent's policy is a best reply (to other agents) within a linear space. Numerical experiments are also provided for both settings to demonstrate the results.
LGOct 12, 2021
On Improving Model-Free Algorithms for Decentralized Multi-Agent Reinforcement LearningWeichao Mao, Lin F. Yang, Kaiqing Zhang et al.
Multi-agent reinforcement learning (MARL) algorithms often suffer from an exponential sample complexity dependence on the number of agents, a phenomenon known as \emph{the curse of multiagents}. In this paper, we address this challenge by investigating sample-efficient model-free algorithms in \emph{decentralized} MARL, and aim to improve existing algorithms along this line. For learning (coarse) correlated equilibria in general-sum Markov games, we propose \emph{stage-based} V-learning algorithms that significantly simplify the algorithmic design and analysis of recent works, and circumvent a rather complicated no-\emph{weighted}-regret bandit subroutine. For learning Nash equilibria in Markov potential games, we propose an independent policy gradient algorithm with a decentralized momentum-based variance reduction technique. All our algorithms are decentralized in that each agent can make decisions based on only its local information. Neither communication nor centralized coordination is required during learning, leading to a natural generalization to a large number of agents. We also provide numerical simulations to corroborate our theoretical findings.
LGOct 12, 2021
Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov GamesWeichao Mao, Tamer Başar
This paper addresses the problem of learning an equilibrium efficiently in general-sum Markov games through decentralized multi-agent reinforcement learning. Given the fundamental difficulty of calculating a Nash equilibrium (NE), we instead aim at finding a coarse correlated equilibrium (CCE), a solution concept that generalizes NE by allowing possible correlations among the agents' strategies. We propose an algorithm in which each agent independently runs optimistic V-learning (a variant of Q-learning) to efficiently explore the unknown environment, while using a stabilized online mirror descent (OMD) subroutine for policy updates. We show that the agents can find an $ε$-approximate CCE in at most $\widetilde{O}( H^6S A /ε^2)$ episodes, where $S$ is the number of states, $A$ is the size of the largest individual action space, and $H$ is the length of an episode. This appears to be the first sample complexity result for learning in generic general-sum Markov games. Our results rely on a novel investigation of an anytime high-probability regret bound for OMD with a dynamic learning rate and weighted regret, which would be of independent interest. One key feature of our algorithm is that it is fully \emph{decentralized}, in the sense that each agent has access to only its local information, and is completely oblivious to the presence of others. This way, our algorithm can readily scale up to an arbitrary number of agents, without suffering from the exponential dependence on the number of agents.
OCJan 4, 2021
Derivative-Free Policy Optimization for Linear Risk-Sensitive and Robust Control Design: Implicit Regularization and Sample ComplexityKaiqing Zhang, Xiangyuan Zhang, Bin Hu et al.
Direct policy search serves as one of the workhorses in modern reinforcement learning (RL), and its applications in continuous control tasks have recently attracted increasing attention. In this work, we investigate the convergence theory of policy gradient (PG) methods for learning the linear risk-sensitive and robust controller. In particular, we develop PG methods that can be implemented in a derivative-free fashion by sampling system trajectories, and establish both global convergence and sample complexity results in the solutions of two fundamental settings in risk-sensitive and robust control: the finite-horizon linear exponential quadratic Gaussian, and the finite-horizon linear-quadratic disturbance attenuation problems. As a by-product, our results also provide the first sample complexity for the global convergence of PG methods on solving zero-sum linear-quadratic dynamic games, a nonconvex-nonconcave minimax optimization problem that serves as a baseline setting in multi-agent reinforcement learning (MARL) with continuous spaces. One feature of our algorithms is that during the learning phase, a certain level of robustness/risk-sensitivity of the controller is preserved, which we termed as the implicit regularization property, and is an essential requirement in safety-critical control systems.
LGOct 7, 2020
Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in Multi-Agent RL and Inventory ControlWeichao Mao, Kaiqing Zhang, Ruihao Zhu et al.
We consider model-free reinforcement learning (RL) in non-stationary Markov decision processes. Both the reward functions and the state transition functions are allowed to vary arbitrarily over time as long as their cumulative variations do not exceed certain variation budgets. We propose Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first model-free algorithm for non-stationary RL, and show that it outperforms existing solutions in terms of dynamic regret. Specifically, RestartQ-UCB with Freedman-type bonus terms achieves a dynamic regret bound of $\widetilde{O}(S^{\frac{1}{3}} A^{\frac{1}{3}} Δ^{\frac{1}{3}} H T^{\frac{2}{3}})$, where $S$ and $A$ are the numbers of states and actions, respectively, $Δ>0$ is the variation budget, $H$ is the number of time steps per episode, and $T$ is the total number of time steps. We further present a parameter-free algorithm named Double-Restart Q-UCB that does not require prior knowledge of the variation budget. We show that our algorithms are \emph{nearly optimal} by establishing an information-theoretical lower bound of $Ω(S^{\frac{1}{3}} A^{\frac{1}{3}} Δ^{\frac{1}{3}} H^{\frac{2}{3}} T^{\frac{2}{3}})$, the first lower bound in non-stationary RL. Numerical experiments validate the advantages of RestartQ-UCB in terms of both cumulative rewards and computational efficiency. We demonstrate the power of our results in examples of multi-agent RL and inventory control across related products.
SYSep 9, 2020
Reinforcement Learning in Non-Stationary Discrete-Time Linear-Quadratic Mean-Field GamesMuhammad Aneeq uz Zaman, Kaiqing Zhang, Erik Miehling et al.
In this paper, we study large population multi-agent reinforcement learning (RL) in the context of discrete-time linear-quadratic mean-field games (LQ-MFGs). Our setting differs from most existing work on RL for MFGs, in that we consider a non-stationary MFG over an infinite horizon. We propose an actor-critic algorithm to iteratively compute the mean-field equilibrium (MFE) of the LQ-MFG. There are two primary challenges: i) the non-stationarity of the MFG induces a linear-quadratic tracking problem, which requires solving a backwards-in-time (non-causal) equation that cannot be solved by standard (causal) RL algorithms; ii) Many RL algorithms assume that the states are sampled from the stationary distribution of a Markov chain (MC), that is, the chain is already mixed, an assumption that is not satisfied for real data sources. We first identify that the mean-field trajectory follows linear dynamics, allowing the problem to be reformulated as a linear quadratic Gaussian problem. Under this reformulation, we propose an actor-critic algorithm that allows samples to be drawn from an unmixed MC. Finite-sample convergence guarantees for the algorithm are then provided. To characterize the performance of our algorithm in multi-agent RL, we have developed an error bound with respect to the Nash equilibrium of the finite-population game.
LGJul 15, 2020
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample ComplexityKaiqing Zhang, Sham M. Kakade, Tamer Başar et al.
Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the corner stones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, our goal is to address the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of $\tilde O(|S||A||B|(1-γ)^{-3}ε^{-2})$ for finding the Nash equilibrium (NE) value up to some $ε$ error, and the $ε$-NE policies with a smooth planning oracle, where $γ$ is the discount factor, and $S,A,B$ denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward-aware setting, with a $\tildeΩ(|S|(|A|+|B|)(1-γ)^{-3}ε^{-2})$ lower bound, where this model-based approach is near-optimal with only a gap on the $|A|,|B|$ dependence. Our results not only demonstrate the sample-efficiency of this basic model-based approach in MARL, but also elaborate on the fundamental tradeoff between its power (easily handling the more challenging reward-agnostic case) and limitation (less adaptive and suboptimal in $|A|,|B|$), particularly arises in the multi-agent context.
AIJun 8, 2020
POLY-HOOT: Monte-Carlo Planning in Continuous Space MDPs with Non-Asymptotic AnalysisWeichao Mao, Kaiqing Zhang, Qiaomin Xie et al.
Monte-Carlo planning, as exemplified by Monte-Carlo Tree Search (MCTS), has demonstrated remarkable performance in applications with finite spaces. In this paper, we consider Monte-Carlo planning in an environment with continuous state-action spaces, a much less understood problem with important applications in control and robotics. We introduce POLY-HOOT, an algorithm that augments MCTS with a continuous armed bandit strategy named Hierarchical Optimistic Optimization (HOO) (Bubeck et al., 2011). Specifically, we enhance HOO by using an appropriate polynomial, rather than logarithmic, bonus term in the upper confidence bounds. Such a polynomial bonus is motivated by its empirical successes in AlphaGo Zero (Silver et al., 2017b), as well as its significant role in achieving theoretical guarantees of finite space MCTS (Shah et al., 2019). We investigate, for the first time, the regret of the enhanced HOO algorithm in non-stationary bandit problems. Using this result as a building block, we establish non-asymptotic convergence guarantees for POLY-HOOT: the value estimate converges to an arbitrarily small neighborhood of the optimal value function at a polynomial rate. We further provide experimental results that corroborate our theoretical findings.