SYMay 28, 2012
Optimal Strategies for Communication and Remote Estimation with an Energy Harvesting SensorAshutosh Nayyar, Tamer Basar, Demosthenis Teneketzis et al.
We consider a remote estimation problem with an energy harvesting sensor and a remote estimator. The sensor observes the state of a discrete-time source which may be a finite state Markov chain or a multi-dimensional linear Gaussian system. It harvests energy from its environment (say, for example, through a solar cell) and uses this energy for the purpose of communicating with the estimator. Due to the randomness of energy available for communication, the sensor may not be able to communicate all the time. The sensor may also want to save its energy for future communications. The estimator relies on messages communicated by the sensor to produce real-time estimates of the source state. We consider the problem of finding a communication scheduling strategy for the sensor and an estimation strategy for the estimator that jointly minimize an expected sum of communication and distortion costs over a finite time horizon. Our goal of joint optimization leads to a decentralized decision-making problem. By viewing the problem from the estimator's perspective, we obtain a dynamic programming characterization for the decentralized decision-making problem that involves optimization over functions. Under some symmetry assumptions on the source statistics and the distortion metric, we show that an optimal communication strategy is described by easily computable thresholds and that the optimal estimate is a simple function of the most recently received sensor observation.
SYSep 8, 2012
Decentralized Stochastic Control with Partial History Sharing: A Common Information ApproachAshutosh Nayyar, Aditya Mahajan, Demosthenis Teneketzis
A general model of decentralized stochastic control called partial history sharing information structure is presented. In this model, at each step the controllers share part of their observation and control history with each other. This general model subsumes several existing models of information sharing as special cases. Based on the information commonly known to all the controllers, the decentralized problem is reformulated as an equivalent centralized problem from the perspective of a coordinator. The coordinator knows the common information and select prescriptions that map each controller's local information to its control actions. The optimal control problem at the coordinator is shown to be a partially observable Markov decision process (POMDP) which is solved using techniques from Markov decision theory. This approach provides (a) structural results for optimal strategies, and (b) a dynamic program for obtaining optimal strategies for all controllers in the original decentralized problem. Thus, this approach unifies the various ad-hoc approaches taken in the literature. In addition, the structural results on optimal control strategies obtained by the proposed approach cannot be obtained by the existing generic approach (the person-by-person approach) for obtaining structural results in decentralized problems; and the dynamic program obtained by the proposed approach is simpler than that obtained by the existing generic approach (the designer's approach) for obtaining dynamic programs in decentralized problems.
SYSep 6, 2013
Structural Results and Explicit Solution for Two-Player LQG Systems on a Finite Time HorizonLaurent Lessard, Ashutosh Nayyar
It is well-known that linear dynamical systems with Gaussian noise and quadratic cost (LQG) satisfy a separation principle. Finding the optimal controller amounts to solving separate dual problems; one for control and one for estimation. For the discrete-time finite-horizon case, each problem is a simple forward or backward recursion. In this paper, we consider a generalization of the LQG problem in which there are two controllers. Each controller is responsible for one of two system inputs, but has access to different subsets of the available measurements. Our paper has three main contributions. First, we prove a fundamental structural result: sufficient statistics for the controllers can be expressed as conditional means of the global state. Second, we give explicit state-space formulae for the optimal controller. These formulae are reminiscent of the classical LQG solution with dual forward and backward recursions, but with the important difference that they are intricately coupled. Lastly, we show how these recursions can be solved efficiently, with computational complexity comparable to that of the centralized problem.
SYJun 17, 2018
Optimal Local and Remote Controllers with Unreliable Uplink ChannelsSeyed Mohammad Asghari, Yi Ouyang, Ashutosh Nayyar
We consider a networked control system consisting of a remote controller and a collection of linear plants, each associated with a local controller. Each local controller directly observes the state of its co-located plant and can inform the remote controller of the plant's state through an unreliable uplink channel. We assume that the downlink channels from the remote controller to local controllers are perfect. The objective of the local controllers and the remote controller is to cooperatively minimize a quadratic performance cost. We provide a dynamic program for this decentralized control problem using the common information approach. Although our problem is not a partially nested problem, we obtain explicit optimal strategies for all controllers. In the optimal strategies, all controllers compute common estimates of the states of the plants based on the common information obtained from the communication network. The remote controller's action is linear in the common state estimates, and the action of each local controller is linear in both the actual state of its co-located plant and the common state estimates. We illustrate our results with numerical experiments using randomly generated models.
GTSep 17, 2012
Nash Equilibria for Stochastic Games with Asymmetric Information-Part 1: Finite GamesAshutosh Nayyar, Abhishek Gupta, Cédric Langbort et al.
A model of stochastic games where multiple controllers jointly control the evolution of the state of a dynamic system but have access to different information about the state and action processes is considered. The asymmetry of information among the controllers makes it difficult to compute or characterize Nash equilibria. Using common information among the controllers, the game with asymmetric information is shown to be equivalent to another game with symmetric information. Further, under certain conditions, a Markov state is identified for the equivalent symmetric information game and its Markov perfect equilibria are characterized. This characterization provides a backward induction algorithm to find Nash equilibria of the original game with asymmetric information in pure or behavioral strategies. Each step of this algorithm involves finding Bayesian Nash equilibria of a one-stage Bayesian game. The class of Nash equilibria of the original game that can be characterized in this backward manner are named common information based Markov perfect equilibria.
SYJun 18, 2018
Optimal Infinite Horizon Decentralized Networked Controllers with Unreliable CommunicationYi Ouyang, Seyed Mohammad Asghari, Ashutosh Nayyar
We consider a decentralized networked control system (DNCS) consisting of a remote controller and a collection of linear plants, each associated with a local controller. Each local controller directly observes the state of its co-located plant and can inform the remote controller of the plant's state through an unreliable uplink channel. The downlink channels from the remote controller to local controllers were assumed to be perfect. The objective of the local controllers and the remote controller is to cooperatively minimize the infinite horizon time average of expected quadratic cost. The finite horizon version of this problem was solved in our prior work [2]. The optimal strategies in the finite horizon case were shown to be characterized by coupled Riccati recursions. In this paper, we show that if the link failure probabilities are below certain critical thresholds, then the coupled Riccati recursions of the finite horizon solution reach a steady state and the corresponding decentralized strategies are optimal. Above these thresholds, we show that no strategy can achieve finite cost. We exploit a connection between our DNCS Riccati recursions and the coupled Riccati recursions of an auxiliary Markov jump linear system to obtain our results. Our main results in Theorems 1 and 2 explicitly identify the critical thresholds for the link failure probabilities and the optimal decentralized control strategies when all link failure probabilities are below their thresholds.
SYJun 23, 2016
Optimal Local and Remote Controllers with Unreliable CommunicationYi Ouyang, Seyed Mohammad Asghari, Ashutosh Nayyar
We consider a decentralized optimal control problem for a linear plant controlled by two controllers, a local controller and a remote controller. The local controller directly observes the state of the plant and can inform the remote controller of the plant state through a packet-drop channel. We assume that the remote controller is able to send acknowledgments to the local controller to signal the successful receipt of transmitted packets. The objective of the two controllers is to cooperatively minimize a quadratic performance cost. We provide a dynamic program for this decentralized control problem using the common information approach. Although our problem is not a partially nested LQG problem, we obtain explicit optimal strategies for the two controllers. In the optimal strategies, both controllers compute a common estimate of the plant state based on the common information. The remote controller's action is linear in the common estimated state, and the local controller's action is linear in both the actual state and the common estimated state.
SYNov 11, 2016
Dynamic Teams and Decentralized Control Problems with Substitutable ActionsSeyed Mohammad Asghari, Ashutosh Nayyar
This paper considers two problems -- a dynamic team problem and a decentralized control problem. The problems we consider do not belong to the known classes of "simpler" dynamic team/decentralized control problems such as partially nested or quadratically invariant problems. However, we show that our problems admit simple solutions under an assumption referred to as the substitutability assumption. Intuitively, substitutability in a team (resp. decentralized control) problem means that the effects of one team member's (resp. controller's) action on the cost function and the information (resp. state dynamics) can be achieved by an action of another member (resp. controller). For the non-partially-nested LQG dynamic team problem, it is shown that under certain conditions linear strategies are optimal. For the non-partially-nested decentralized LQG control problem, the state structure can be exploited to obtain optimal control strategies with recursively update-able sufficient statistics. These results suggest that substitutability can work as a counterpart of the information structure requirements that enable simplification of dynamic teams and decentralized control problems.
SYJan 10, 2016
Decentralized Control Problems with Substitutable ActionsSeyed Mohammad Asghari, Ashutosh Nayyar
We consider a decentralized system with multiple controllers and define substitutability of one controller by another in open-loop strategies. We explore the implications of this property on the optimization of closed-loop strategies. In particular, we focus on the decentralized LQG problem with substitutable actions. Even though the problem we formulate does not belong to the known classes of "simpler" decentralized problems such as partially nested or quadratically invariant problems, our results show that, under the substitutability assumption, linear strategies are optimal and we provide a complete state space characterization of optimal strategies. We also identify a family of information structures that all give the same optimal cost as the centralized information structure under the substitutability assumption. Our results suggest that open-loop substitutability can work as a counterpart of the information structure requirements that enable simplification of decentralized control problems.
SYFeb 9, 2019
Worst-case Guarantees for Remote Estimation of an Uncertain SourceMukul Gagrani, Yi Ouyang, Mohammad Rasouli et al.
Consider a remote estimation problem where a sensor wants to communicate the state of an uncertain source to a remote estimator over a finite time horizon. The uncertain source is modeled as an autoregressive process with bounded noise. Given that the sensor has a limited communication budget, the sensor must decide when to transmit the state to the estimator who has to produce real-time estimates of the source state. In this paper, we consider the problem of finding a scheduling strategy for the sensor and an estimation strategy for the estimator to jointly minimize the worst-case maximum instantaneous estimation error over the time horizon. This leads to a decentralized minimax decision-making problem. We obtain a complete characterization of optimal strategies for this decentralized minimax problem. In particular, we show that an open loop communication scheduling strategy is optimal and the optimal estimate depends only on the most recently received sensor observation.
LGAug 24, 2023
Conditional Kernel Imitation Learning for Continuous State EnvironmentsRishabh Agrawal, Nathan Dahlin, Rahul Jain et al.
Imitation Learning (IL) is an important paradigm within the broader reinforcement learning (RL) methodology. Unlike most of RL, it does not assume availability of reward-feedback. Reward inference and shaping are known to be difficult and error-prone methods particularly when the demonstration data comes from human experts. Classical methods such as behavioral cloning and inverse reinforcement learning are highly sensitive to estimation errors, a problem that is particularly acute in continuous state space problems. Meanwhile, state-of-the-art IL algorithms convert behavioral policy learning problems into distribution-matching problems which often require additional online interaction data to be effective. In this paper, we consider the problem of imitation learning in continuous state space environments based solely on observed behavior, without access to transition dynamics information, reward structure, or, most importantly, any additional interactions with the environment. Our approach is based on the Markov balance equation and introduces a novel conditional kernel density estimation-based imitation learning framework. It involves estimating the environment's transition dynamics using conditional kernel density estimators and seeks to satisfy the probabilistic balance equations for the environment. We establish that our estimators satisfy basic asymptotic consistency requirements. Through a series of numerical experiments on continuous state benchmark environments, we show consistently superior empirical performance over many state-of-the-art IL algorithms.
LGOct 16, 2023
Posterior Sampling-based Online Learning for Episodic POMDPsDengwang Tang, Dongze Ye, Rahul Jain et al.
Learning in POMDPs is known to be significantly harder than in MDPs. In this paper, we consider the online learning problem for episodic POMDPs with unknown transition and observation models. We propose a Posterior Sampling-based reinforcement learning algorithm for POMDPs (PS4POMDPs), which is much simpler and more implementable compared to state-of-the-art optimism-based online learning algorithms for POMDPs. We show that the Bayesian regret of the proposed algorithm scales as the square root of the number of episodes and is polynomial in the other parameters. In a general setting, the regret scales exponentially in the horizon length $H$, and we show that this is inevitable by providing a lower bound. However, when the POMDP is undercomplete and weakly revealing (a common assumption in the recent literature), we establish a polynomial Bayesian regret bound. We finally propose a posterior sampling algorithm for multi-agent POMDPs, and show it too has sublinear regret.
SYFeb 2, 2018
Decentralized Control of Stochastically Switched Linear System with Unreliable CommunicationSeyed Mohammad Asghari, Yi Ouyang, Ashutosh Nayyar
We consider a networked control system (NCS) consisting of two plants, a global plant and a local plant, and two controllers, a global controller and a local controller. The global (resp. local) plant follows discrete-time stochastically switched linear dynamics with a continuous global (resp. local) state and a discrete global (resp. local) mode. We assume that the state and mode of the global plant are observed by both controllers while the state and mode of the local plant are only observed by the local controller. The local controller can inform the global controller of the local plant's state and mode through an unreliable TCP-like communication channel where successful transmissions are acknowledged. The objective of the controllers is to cooperatively minimize a modes-dependent quadratic cost over a finite time horizon. Following the method developed in [1] and [2], we construct a dynamic program based on common information and a decomposition of strategies, and use it to obtain explicit optimal strategies for the controllers. In the optimal strategies, both controllers compute a common estimate of the local plant's state. The global controller's action is linear in the state of the global plant and the common estimated state, and the local controller's action is linear in the actual states of both plants and the common estimated state. Furthermore, the gain matrices for the global controller depend on the global mode and its observation about the local mode, while the gain matrices for the local controller depend on the actual modes of both plants and the global controller's observation about the local mode.
AIApr 10, 2023
A Novel Point-based Algorithm for Multi-agent Control Using the Common Information ApproachDengwang Tang, Ashutosh Nayyar, Rahul Jain
The Common Information (CI) approach provides a systematic way to transform a multi-agent stochastic control problem to a single-agent partially observed Markov decision problem (POMDP) called the coordinator's POMDP. However, such a POMDP can be hard to solve due to its extraordinarily large action space. We propose a new algorithm for multi-agent stochastic control problems, called coordinator's heuristic search value iteration (CHSVI), that combines the CI approach and point-based POMDP algorithms for large action spaces. We demonstrate the algorithm through optimally solving several benchmark problems.
LGMay 16
When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models RevisitedRishabh Agrawal, Rahul Jain, Ashutosh Nayyar
Behavior Foundation Models (BFMs) enable scalable imitation learning (IL) by pretraining task-agnostic representations that can be rapidly adapted to new tasks. However, existing BFMs assume fixed environment dynamics, limiting their robustness under real-world shifts such as changes in friction, actuation, or sensor noise. We address this by formulating BFM task-inference as a robust minimax optimization problem, enabling adaptation to worst-case dynamics perturbations without modifying pretraining. To the best of our knowledge, this is the first BFM-based framework that achieves robustness to dynamics shifts while relying solely on offline data from a single nominal environment. Our approach significantly outperforms standard BFM and robust offline IL baselines under dynamics shifts. These results demonstrate that robust policy can be achieved entirely at task-inference time, improving the practicality of BFMs in dynamic settings.
LGAug 17, 2024
Markov Balance Satisfaction Improves Performance in Strictly Batch Offline Imitation LearningRishabh Agrawal, Nathan Dahlin, Rahul Jain et al.
Imitation learning (IL) is notably effective for robotic tasks where directly programming behaviors or defining optimal control costs is challenging. In this work, we address a scenario where the imitator relies solely on observed behavior and cannot make environmental interactions during learning. It does not have additional supplementary datasets beyond the expert's dataset nor any information about the transition dynamics. Unlike state-of-the-art (SOTA) IL methods, this approach tackles the limitations of conventional IL by operating in a more constrained and realistic setting. Our method uses the Markov balance equation and introduces a novel conditional density estimation-based imitation learning framework. It employs conditional normalizing flows for transition dynamics estimation and aims at satisfying a balance equation for the environment. Through a series of numerical experiments on Classic Control and MuJoCo environments, we demonstrate consistently superior empirical performance compared to many SOTA IL algorithms.
SYMay 10
Action Recommendations for Sequentially Rational Strategic AgentsRenyan Sun, Ashutosh Nayyar
We consider a finite-horizon discrete-time dynamic system that is jointly controlled by two strategic agents. There is a system designer that has its own reward function but does not have direct control over the agents' actions. We consider an information structure where the current state and all past history are equally accessible by the designer and the agents. The designer sends action recommendations to the agents at each time step. Each agent can use the received recommendation and the available information to choose its action. We are interested in the setting where the designer would like to send recommendations in a way that incentivizes the agents to adopt obedient strategies, i.e., to take the action recommended by the designer. Our goal is to find an optimal action recommendation strategy for the designer that maximizes the designer's objective while ensuring that obedient strategies are \emph{sequentially rational} for the agents. We provide an algorithm for the designer's problem that involves solving a family of linear programs in a backward inductive manner.
LGMar 21
Bayesian Learning in Episodic Zero-Sum GamesChang-Wei Yueh, Andy Zhao, Ashutosh Nayyar et al.
We study Bayesian learning in episodic, finite-horizon zero-sum Markov games with unknown transition and reward models. We investigate a posterior algorithm in which each player maintains a Bayesian posterior over the game model, independently samples a game model at the beginning of each episode, and computes an equilibrium policy for the sampled model. We analyze two settings: (i) Both players use the posterior sampling algorithm, and (ii) Only one player uses posterior sampling while the opponent follows an arbitrary learning algorithm. In each setting, we provide guarantees on the expected regret of the posterior sampling agent. Our notion of regret compares the expected total reward of the learning agent against the expected total reward under equilibrium policies of the true game. Our main theoretical result is an expected regret bound for the posterior sampling agent of order $O(HS\sqrt{ABHK\log(SABHK)})$ where $K$ is the number of episodes, $H$ is the episode length, $S$ is the number of states, and $A,B$ are the action space sizes of the two players. Experiments in a grid-world predator--prey domain illustrate the sublinear regret scaling and show that posterior sampling competes favorably with a fictitious-play baseline.
LGNov 11, 2025
Balance Equation-based Distributionally Robust Offline Imitation LearningRishabh Agrawal, Yusuf Alvi, Rahul Jain et al.
Imitation Learning (IL) has proven highly effective for robotic and control tasks where manually designing reward functions or explicit controllers is infeasible. However, standard IL methods implicitly assume that the environment dynamics remain fixed between training and deployment. In practice, this assumption rarely holds where modeling inaccuracies, real-world parameter variations, and adversarial perturbations can all induce shifts in transition dynamics, leading to severe performance degradation. We address this challenge through Balance Equation-based Distributionally Robust Offline Imitation Learning, a framework that learns robust policies solely from expert demonstrations collected under nominal dynamics, without requiring further environment interaction. We formulate the problem as a distributionally robust optimization over an uncertainty set of transition models, seeking a policy that minimizes the imitation loss under the worst-case transition distribution. Importantly, we show that this robust objective can be reformulated entirely in terms of the nominal data distribution, enabling tractable offline learning. Empirical evaluations on continuous-control benchmarks demonstrate that our approach achieves superior robustness and generalization compared to state-of-the-art offline IL baselines, particularly under perturbed or shifted environments.
OCFeb 13, 2024
Model approximation in MDPs with unbounded per-step costBerk Bozkurt, Aditya Mahajan, Ashutosh Nayyar et al.
We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process $\mathcal{M}$ when we only have access to an approximate model $\hat{\mathcal{M}}$. How well does an optimal policy $\hatπ^{\star}$ of the approximate model perform when used in the original model $\mathcal{M}$? We answer this question by bounding a weighted norm of the difference between the value function of $\hatπ^\star $ when used in $\mathcal{M}$ and the optimal value function of $\mathcal{M}$. We then extend our results and obtain potentially tighter upper bounds by considering affine transformations of the per-step cost. We further provide upper bounds that explicitly depend on the weighted distance between cost functions and weighted distance between transition kernels of the original and approximate models. We present examples to illustrate our results.
LGMay 23, 2024
Pure Exploration for Constrained Best Mixed Arm Identification with a Fixed BudgetDengwang Tang, Rahul Jain, Ashutosh Nayyar et al.
In this paper, we introduce the constrained best mixed arm identification (CBMAI) problem with a fixed budget. This is a pure exploration problem in a stochastic finite armed bandit model. Each arm is associated with a reward and multiple types of costs from unknown distributions. Unlike the unconstrained best arm identification problem, the optimal solution for the CBMAI problem may be a randomized mixture of multiple arms. The goal thus is to find the best mixed arm that maximizes the expected reward subject to constraints on the expected costs with a given learning budget $N$. We propose a novel, parameter-free algorithm, called the Score Function-based Successive Reject (SFSR) algorithm, that combines the classical successive reject framework with a novel score-function-based rejection criteria based on linear programming theory to identify the optimal support. We provide a theoretical upper bound on the mis-identification (of the the support of the best mixed arm) probability and show that it decays exponentially in the budget $N$ and some constants that characterize the hardness of the problem instance. We also develop an information theoretic lower bound on the error probability that shows that these constants appropriately characterize the problem difficulty. We validate this empirically on a number of average and hard instances.
AIMay 24, 2023
Optimal Control of Logically Constrained Partially Observable and Multi-Agent Markov Decision ProcessesKrishna C. Kalagarla, Dhruva Kartik, Dongming Shen et al.
Autonomous systems often have logical constraints arising, for example, from safety, operational, or regulatory requirements. Such constraints can be expressed using temporal logic specifications. The system state is often partially observable. Moreover, it could encompass a team of multiple agents with a common objective but disparate information structures and constraints. In this paper, we first introduce an optimal control theory for partially observable Markov decision processes (POMDPs) with finite linear temporal logic constraints. We provide a structured methodology for synthesizing policies that maximize a cumulative reward while ensuring that the probability of satisfying a temporal logic constraint is sufficiently high. Our approach comes with guarantees on approximate reward optimality and constraint satisfaction. We then build on this approach to design an optimal control framework for logically constrained multi-agent settings with information asymmetry. We illustrate the effectiveness of our approach by implementing it on several case studies.
LGSep 8, 2021
A Bayesian Learning Algorithm for Unknown Zero-sum Stochastic Games with an Arbitrary OpponentMehdi Jafarnia-Jahromi, Rahul Jain, Ashutosh Nayyar
In this paper, we propose Posterior Sampling Reinforcement Learning for Zero-sum Stochastic Games (PSRL-ZSG), the first online learning algorithm that achieves Bayesian regret bound of $O(HS\sqrt{AT})$ in the infinite-horizon zero-sum stochastic games with average-reward criterion. Here $H$ is an upper bound on the span of the bias function, $S$ is the number of states, $A$ is the number of joint actions and $T$ is the horizon. We consider the online setting where the opponent can not be controlled and can take any arbitrary time-adaptive history-dependent strategy. Our regret bound improves on the best existing regret bound of $O(\sqrt[3]{DS^2AT^2})$ by Wei et al. (2017) under the same assumption and matches the theoretical lower bound in $T$.
SYAug 19, 2021
A relaxed technical assumption for posterior sampling-based reinforcement learning for control of unknown linear systemsMukul Gagrani, Sagar Sudhakara, Aditya Mahajan et al.
We revisit the Thompson sampling algorithm to control an unknown linear quadratic (LQ) system recently proposed by Ouyang et al (arXiv:1709.04047). The regret bound of the algorithm was derived under a technical assumption on the induced norm of the closed loop system. In this technical note, we show that by making a minor modification in the algorithm (in particular, ensuring that an episode does not end too soon), this technical assumption on the induced norm can be replaced by a milder assumption in terms of the spectral radius of the closed loop system. The modified algorithm has the same Bayesian regret of $\tilde{\mathcal{O}}(\sqrt{T})$, where $T$ is the time-horizon and the $\tilde{\mathcal{O}}(\cdot)$ notation hides logarithmic terms in~$T$.
SYAug 18, 2021
Scalable regret for learning to control network-coupled subsystems with unknown dynamicsSagar Sudhakara, Aditya Mahajan, Ashutosh Nayyar et al.
We consider the problem of controlling an unknown linear quadratic Gaussian (LQG) system consisting of multiple subsystems connected over a network. Our goal is to minimize and quantify the regret (i.e. loss in performance) of our strategy with respect to an oracle who knows the system model. Viewing the interconnected subsystems globally and directly using existing LQG learning algorithms for the global system results in a regret that increases super-linearly with the number of subsystems. Instead, we propose a new Thompson sampling based learning algorithm which exploits the structure of the underlying network. We show that the expected regret of the proposed algorithm is bounded by $\tilde{\mathcal{O}} \big( n \sqrt{T} \big)$ where $n$ is the number of subsystems, $T$ is the time horizon and the $\tilde{\mathcal{O}}(\cdot)$ notation hides logarithmic terms in $n$ and $T$. Thus, the regret scales linearly with the number of subsystems. We present numerical experiments to illustrate the salient features of the proposed algorithm.
LGFeb 25, 2021
Online Learning for Unknown Partially Observable MDPsMehdi Jafarnia-Jahromi, Rahul Jain, Ashutosh Nayyar
Solving Partially Observable Markov Decision Processes (POMDPs) is hard. Learning optimal controllers for POMDPs when the model is unknown is harder. Online learning of optimal controllers for unknown POMDPs, which requires efficient learning using regret-minimizing algorithms that effectively tradeoff exploration and exploitation, is even harder, and no solution exists currently. In this paper, we consider infinite-horizon average-cost POMDPs with unknown transition model, though a known observation model. We propose a natural posterior sampling-based reinforcement learning algorithm (PSRL-POMDP) and show that it achieves a regret bound of $O(\log T)$, where $T$ is the time horizon, when the parameter set is finite. In the general case (continuous parameter set), we show that the algorithm achieves $O (T^{2/3})$ regret under two technical assumptions. To the best of our knowledge, this is the first online RL algorithm for POMDPs and has sub-linear regret.
SYNov 9, 2020
Thompson sampling for linear quadratic mean-field teamsMukul Gagrani, Sagar Sudhakara, Aditya Mahajan et al.
We consider optimal control of an unknown multi-agent linear quadratic (LQ) system where the dynamics and the cost are coupled across the agents through the mean-field (i.e., empirical mean) of the states and controls. Directly using single-agent LQ learning algorithms in such models results in regret which increases polynomially with the number of agents. We propose a new Thompson sampling based learning algorithm which exploits the structure of the system model and show that the expected Bayesian regret of our proposed algorithm for a system with agents of $|M|$ different types at time horizon $T$ is $\tilde{\mathcal{O}} \big( |M|^{1.5} \sqrt{T} \big)$ irrespective of the total number of agents, where the $\tilde{\mathcal{O}}$ notation hides logarithmic factors in $T$. We present detailed numerical experiments to illustrate the salient features of the proposed algorithm.
LGJan 27, 2020
Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical SystemsSeyed Mohammad Asghari, Yi Ouyang, Ashutosh Nayyar
Regret analysis is challenging in Multi-Agent Reinforcement Learning (MARL) primarily due to the dynamical environments and the decentralized information among agents. We attempt to solve this challenge in the context of decentralized learning in multi-agent linear-quadratic (LQ) dynamical systems. We begin with a simple setup consisting of two agents and two dynamically decoupled stochastic linear systems, each system controlled by an agent. The systems are coupled through a quadratic cost function. When both systems' dynamics are unknown and there is no communication among the agents, we show that no learning policy can generate sub-linear in $T$ regret, where $T$ is the time horizon. When only one system's dynamics are unknown and there is one-directional communication from the agent controlling the unknown system to the other agent, we propose a MARL algorithm based on the construction of an auxiliary single-agent LQ problem. The auxiliary single-agent problem in the proposed MARL algorithm serves as an implicit coordination mechanism among the two learning agents. This allows the agents to achieve a regret within $O(\sqrt{T})$ of the regret of the auxiliary single-agent problem. Consequently, using existing results for single-agent LQ regret, our algorithm provides a $\tilde{O}(\sqrt{T})$ regret bound. (Here $\tilde{O}(\cdot)$ hides constants and logarithmic factors). Our numerical experiments indicate that this bound is matched in practice. From the two-agent problem, we extend our results to multi-agent LQ systems with certain communication patterns.
MLDec 4, 2018
Sequential Experiment Design for Hypothesis VerificationDhruva Kartik, Ashutosh Nayyar, Urbashi Mitra
Hypothesis testing is an important problem with applications in target localization, clinical trials etc. Many active hypothesis testing strategies operate in two phases: an exploration phase and a verification phase. In the exploration phase, selection of experiments is such that a moderate level of confidence on the true hypothesis is achieved. Subsequent experiment design aims at improving the confidence level on this hypothesis to the desired level. In this paper, the focus is on the verification phase. A confidence measure is defined and active hypothesis testing is formulated as a confidence maximization problem in an infinite-horizon average-reward Partially Observable Markov Decision Process (POMDP) setting. The problem of maximizing confidence conditioned on a particular hypothesis is referred to as the hypothesis verification problem. The relationship between hypothesis testing and verification problems is established. The verification problem can be formulated as a Markov Decision Process (MDP). Optimal solutions for the verification MDP are characterized and a simple heuristic adaptive strategy for verification is proposed based on a zero-sum game interpretation of Kullback-Leibler divergences. It is demonstrated through numerical experiments that the heuristic performs better in some scenarios compared to existing methods in literature.
LGSep 14, 2017
Learning Unknown Markov Decision Processes: A Thompson Sampling ApproachYi Ouyang, Mukul Gagrani, Ashutosh Nayyar et al.
We consider the problem of learning an unknown Markov Decision Process (MDP) that is weakly communicating in the infinite horizon setting. We propose a Thompson Sampling-based reinforcement learning algorithm with dynamic episodes (TSDE). At the beginning of each episode, the algorithm generates a sample from the posterior distribution over the unknown model parameters. It then follows the optimal stationary policy for the sampled model for the rest of the episode. The duration of each episode is dynamically determined by two stopping criteria. The first stopping criterion controls the growth rate of episode length. The second stopping criterion happens when the number of visits to any state-action pair is doubled. We establish $\tilde O(HS\sqrt{AT})$ bounds on expected regret under a Bayesian setting, where $S$ and $A$ are the sizes of the state and action spaces, $T$ is time, and $H$ is the bound of the span. This regret bound matches the best available bound for weakly communicating MDPs. Numerical results show it to perform better than existing algorithms for infinite horizon MDPs.