SYSep 8, 2012
Decentralized Stochastic Control with Partial History Sharing: A Common Information ApproachAshutosh Nayyar, Aditya Mahajan, Demosthenis Teneketzis
A general model of decentralized stochastic control called partial history sharing information structure is presented. In this model, at each step the controllers share part of their observation and control history with each other. This general model subsumes several existing models of information sharing as special cases. Based on the information commonly known to all the controllers, the decentralized problem is reformulated as an equivalent centralized problem from the perspective of a coordinator. The coordinator knows the common information and select prescriptions that map each controller's local information to its control actions. The optimal control problem at the coordinator is shown to be a partially observable Markov decision process (POMDP) which is solved using techniques from Markov decision theory. This approach provides (a) structural results for optimal strategies, and (b) a dynamic program for obtaining optimal strategies for all controllers in the original decentralized problem. Thus, this approach unifies the various ad-hoc approaches taken in the literature. In addition, the structural results on optimal control strategies obtained by the proposed approach cannot be obtained by the existing generic approach (the person-by-person approach) for obtaining structural results in decentralized problems; and the dynamic program obtained by the proposed approach is simpler than that obtained by the existing generic approach (the designer's approach) for obtaining dynamic programs in decentralized problems.
SYJul 25, 2018
Remote estimation over a packet-drop channel with Markovian stateJhelum Chakravorty, Aditya Mahajan
We investigate a remote estimation problem in which a transmitter observes a Markov source and chooses the power level to transmit it over a time-varying packet-drop channel. The channel is modeled as a channel with Markovian state where the packet drop probability depends on the channel state and the transmit power. A receiver observes the channel output and the channel state and estimates the source realization. The receiver also feeds back the channel state and an acknowledgment for successful reception to the transmitter. We consider two models for the source---finite state Markov chains and first-order autoregressive processes. For the first model, using ideas from team theory, we establish the structure of optimal transmission and estimation strategies and identify a dynamic program to determine optimal strategies with that structure. For the second model, we assume that the noise process has unimodal and symmetric distribution. Using ideas from majorization theory, we show that the optimal transmission strategy is symmetric and monotonic and the optimal estimation strategy is like Kalman filter. Consequently, when there are a finite number of power levels, the optimal transmission strategy may be described using thresholds that depend on the channel state. Finally, we propose a simulation based approach (Renewal Monte Carlo) to compute the optimal thresholds and optimal performance and elucidate the algorithm with an example.
SYAug 7, 2012
Optimal decentralized control of coupled subsystems with control sharingAditya Mahajan
Subsystems that are coupled due to dynamics and costs arise naturally in various communication applications. In many such applications the control actions are shared between different control stations giving rise to a \emph{control sharing} information structure. Previous studies of control-sharing have concentrated on the linear quadratic Gaussian setup and a solution approach tailored to continuous valued control actions. In this paper a three step solution approach for finite valued control actions is presented. In the first step, a person-by-person approach is used to identify redundant data or a sufficient statistic for local information at each control station. In the second step, the common-information based approach of Nayyar et al.\ (2011) is used to find a sufficient statistic for the common information shared between all control stations and to obtain a dynamic programming decomposition. In the third step, the specifics of the model are used to simplify the sufficient statistic and the dynamic program. As an example, an exact solution of a two-user multiple access broadcast system is presented.
OCJun 11, 2016
Fundamental limits of remote estimation of autoregressive Markov processes under communication constraintsJhelum Chakravorty, Aditya Mahajan
The fundamental limits of remote estimation of Markov processes under communication constraints are presented. The remote estimation system consists of a sensor and an estimator. The sensor observes a discrete-time Markov process, which is a symmetric countable state Markov source or a Gauss-Markov process. At each time, the sensor either transmits the current state of the Markov process or does not transmit at all. Communication is noiseless but costly. The estimator estimates the Markov process based on the transmitted observations. In such a system, there is a trade-off between communication cost and estimation accuracy. Two fundamental limits of this trade-off are characterized for infinite horizon discounted cost and average cost setups. First, when each transmission is costly, we characterize the minimum achievable cost of communication plus estimation error. Second, when there is a constraint on the average number of transmissions, we characterize the minimum achievable estimation error. Transmission and estimation strategies that achieve these fundamental limits are also identified.
LGNov 6, 2022
On learning history based policies for controlling Markov decision processesGandharv Patil, Aditya Mahajan, Doina Precup
Reinforcementlearning(RL)folkloresuggeststhathistory-basedfunctionapproximationmethods,suchas recurrent neural nets or history-based state abstraction, perform better than their memory-less counterparts, due to the fact that function approximation in Markov decision processes (MDP) can be viewed as inducing a Partially observable MDP. However, there has been little formal analysis of such history-based algorithms, as most existing frameworks focus exclusively on memory-less features. In this paper, we introduce a theoretical framework for studying the behaviour of RL algorithms that learn to control an MDP using history-based feature abstraction mappings. Furthermore, we use this framework to design a practical RL algorithm and we numerically evaluate its effectiveness on a set of continuous control tasks.
LGFeb 6, 2023
Dealing With Non-stationarity in Decentralized Cooperative Multi-Agent Deep Reinforcement Learning via Multi-Timescale LearningHadi Nekoei, Akilesh Badrinaaraayanan, Amit Sinha et al.
Decentralized cooperative multi-agent deep reinforcement learning (MARL) can be a versatile learning framework, particularly in scenarios where centralized training is either not possible or not practical. One of the critical challenges in decentralized deep MARL is the non-stationarity of the learning environment when multiple agents are learning concurrently. A commonly used and efficient scheme for decentralized MARL is independent learning in which agents concurrently update their policies independently of each other. We first show that independent learning does not always converge, while sequential learning where agents update their policies one after another in a sequence is guaranteed to converge to an agent-by-agent optimal solution. In sequential learning, when one agent updates its policy, all other agent's policies are kept fixed, alleviating the challenge of non-stationarity due to simultaneous updates in other agents' policies. However, it can be slow because only one agent is learning at any time. Therefore it might also not always be practical. In this work, we propose a decentralized cooperative MARL algorithm based on multi-timescale learning. In multi-timescale learning, all agents learn simultaneously, but at different learning rates. In our proposed method, when one agent updates its policy, other agents are allowed to update their policies as well, but at a slower rate. This speeds up sequential learning, while also minimizing non-stationarity caused by other agents updating concurrently. Multi-timescale learning outperforms state-of-the-art decentralized learning methods on a set of challenging multi-agent cooperative tasks in the epymarl(Papoudakis et al., 2020) benchmark. This can be seen as a first step towards more general decentralized cooperative deep MARL methods based on multi-timescale learning.
SYSep 24, 2024
Agent-state based policies in POMDPs: Beyond belief-state MDPsAmit Sinha, Aditya Mahajan
The traditional approach to POMDPs is to convert them into fully observed MDPs by considering a belief state as an information state. However, a belief-state based approach requires perfect knowledge of the system dynamics and is therefore not applicable in the learning setting where the system model is unknown. Various approaches to circumvent this limitation have been proposed in the literature. We present a unified treatment of some of these approaches by viewing them as models where the agent maintains a local recursively updateable agent state and chooses actions based on the agent state. We highlight the different classes of agent-state based policies and the various approaches that have been proposed in the literature to find good policies within each class. These include the designer's approach to find optimal non-stationary agent-state based policies, policy search approaches to find a locally optimal stationary agent-state based policies, and the approximate information state to find approximately optimal stationary agent-state based policies. We then present how ideas from the approximate information state approach have been used to improve Q-learning and actor-critic algorithms for learning in POMDPs.
LGJun 9, 2023
Approximate information state based convergence analysis of recurrent Q-learningErfan Seyedsalehi, Nima Akbarzadeh, Amit Sinha et al.
In spite of the large literature on reinforcement learning (RL) algorithms for partially observable Markov decision processes (POMDPs), a complete theoretical understanding is still lacking. In a partially observable setting, the history of data available to the agent increases over time so most practical algorithms either truncate the history to a finite window or compress it using a recurrent neural network leading to an agent state that is non-Markovian. In this paper, it is shown that in spite of the lack of the Markov property, recurrent Q-learning (RQL) converges in the tabular setting. Moreover, it is shown that the quality of the converged limit depends on the quality of the representation which is quantified in terms of what is known as an approximate information state (AIS). Based on this characterization of the approximation error, a variant of RQL with AIS losses is presented. This variant performs better than a strong baseline for RQL that does not use AIS losses. It is demonstrated that there is a strong correlation between the performance of RQL over time and the loss associated with the AIS representation.
LGJul 8, 2024
Periodic agent-state based Q-learning for POMDPsAmit Sinha, Matthieu Geist, Aditya Mahajan
The standard approach for Partially Observable Markov Decision Processes (POMDPs) is to convert them to a fully observed belief-state MDP. However, the belief state depends on the system model and is therefore not viable in reinforcement learning (RL) settings. A widely used alternative is to use an agent state, which is a model-free, recursively updateable function of the observation history. Examples include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a stationary policy. Our main thesis that we illustrate via examples is that because the agent state does not satisfy the Markov property, non-stationary agent-state based policies can outperform stationary ones. To leverage this feature, we propose PASQL (periodic agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies. By combining ideas from periodic Markov chains and stochastic approximation, we rigorously establish that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy. Finally, we present a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.
61.4LGMar 18
Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded CostsAbhishek Gupta, Aditya Mahajan
Markov decision processes (MDPs) is viewed as an optimization of an objective function over certain linear operators over general function spaces. Using the well-established perturbation theory of linear operators, this viewpoint allows one to identify derivatives of the objective function as a function of the linear operators. This leads to generalization of many well-known results in reinforcement learning to cases with generate state and action spaces. Prior results of this type were only established in the finite-state finite-action MDP settings and in settings with certain linear function approximations. The framework also leads to new low-complexity PPO-type reinforcement learning algorithms for general state and action space MDPs.
LGJan 17, 2024
Bridging State and History Representations: Understanding Self-Predictive RLTianwei Ni, Benjamin Eysenbach, Erfan Seyedsalehi et al.
Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, MDPs with distractors, and POMDPs with sparse rewards. These findings culminate in a set of preliminary guidelines for RL practitioners.
OCFeb 13, 2024
Model approximation in MDPs with unbounded per-step costBerk Bozkurt, Aditya Mahajan, Ashutosh Nayyar et al.
We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process $\mathcal{M}$ when we only have access to an approximate model $\hat{\mathcal{M}}$. How well does an optimal policy $\hatπ^{\star}$ of the approximate model perform when used in the original model $\mathcal{M}$? We answer this question by bounding a weighted norm of the difference between the value function of $\hatπ^\star $ when used in $\mathcal{M}$ and the optimal value function of $\mathcal{M}$. We then extend our results and obtain potentially tighter upper bounds by considering affine transformations of the per-step cost. We further provide upper bounds that explicitly depend on the weighted distance between cost functions and weighted distance between transition kernels of the original and approximate models. We present examples to illustrate our results.
20.4MAApr 10
Risk-seeking conservative policy iteration with agent-state based policies for Dec-POMDPs with guaranteed convergenceAmit Sinha, Matthieu Geist, Aditya Mahajan
Optimally solving decentralized decision-making problems modeled as Dec-POMDPs is known to be NEXP-complete. These optimal solutions are policies based on the entire history of observations and actions of an agent. However, some applications may require more compact policies because of limited compute capabilities, which can be modeled by considering a limited number of memory states (or agent states). While such an agent-state based policy class may not contain the optimal solution, it is still of practical interest to find the best agent-state policy within the class. We focus on an iterated best response style algorithm which guarantees monotonic improvements and convergence to a local optimum in polynomial runtime in the Dec-POMDP model size. In order to obtain a better local optimum, we use a modified objective which incentivizes risk-seeking alongside a conservative policy iteration update. Our empirical results show that our approach performs as well as state-of-the-art approaches on several benchmark Dec-POMDPs, achieving near-optimal performance while having polynomial runtime despite the limited memory. We also show that using more agent states (a larger memory) leads to greater performance. Our approach provides a novel way of incorporating memory constraints on the agents in the Dec-POMDP problem.
SYApr 5, 2025
Task load dependent decision referrals for joint binary classification in human-automation teamsKesav Kaza, Jerome Le Ny, Aditya Mahajan
We consider the problem of optimal decision referrals in human-automation teams performing binary classification tasks. The automation, which includes a pre-trained classifier, observes data for a batch of independent tasks, analyzes them, and may refer a subset of tasks to a human operator for fresh and final analysis. Our key modeling assumption is that human performance degrades with task load. We model the problem of choosing which tasks to refer as a stochastic optimization problem and show that, for a given task load, it is optimal to myopically refer tasks that yield the largest reduction in expected cost, conditional on the observed data. This provides a ranking scheme and a policy to determine the optimal set of tasks for referral. We evaluate this policy against a baseline through an experimental study with human participants. Using a radar screen simulator, participants made binary target classification decisions under time constraint. They were guided by a decision rule provided to them, but were still prone to errors under time pressure. An initial experiment estimated human performance model parameters, while a second experiment compared two referral policies. Results show statistically significant gains for the proposed optimal referral policy over a blind policy that determines referrals using the automation and human-performance models but not based on the observed data.
LGJan 31, 2025
A Theoretical Justification for Asymmetric Actor-Critic AlgorithmsGaspard Lambrechts, Damien Ernst, Aditya Mahajan
In reinforcement learning for partially observable environments, many successful algorithms have been developed within the asymmetric learning paradigm. This paradigm leverages additional state information available at training time for faster learning. Although the proposed learning objectives are usually theoretically sound, these methods still lack a precise theoretical justification for their potential benefits. We propose such a justification for asymmetric actor-critic algorithms with linear function approximators by adapting a finite-time convergence analysis to this setting. The resulting finite-time bound reveals that the asymmetric critic eliminates error terms arising from aliasing in the agent state.
LGAug 29, 2025
Convergence of regularized agent-state-based Q-learning in POMDPsAmit Sinha, Matthieu Geist, Aditya Mahajan
In this paper, we present a framework to understand the convergence of commonly used Q-learning reinforcement learning algorithms in practice. Two salient features of such algorithms are: (i)~the Q-table is recursively updated using an agent state (such as the state of a recurrent neural network) which is not a belief state or an information state and (ii)~policy regularization is often used to encourage exploration and stabilize the learning algorithm. We investigate the simplest form of such Q-learning algorithms which we call regularized agent-state-based Q-learning (RASQL) and show that it converges under mild technical conditions to the fixed point of an appropriately defined regularized MDP, which depends on the stationary distribution induced by the behavioral policy. We also show that a similar analysis continues to work for a variant of RASQL that learns periodic policies. We present numerical examples to illustrate that the empirical convergence behavior matches with the proposed theoretical limit.
LGNov 27, 2024
Concentration of Cumulative Reward in Markov Decision ProcessesBorna Sayedana, Peter E. Caines, Aditya Mahajan
In this paper, we investigate the concentration properties of cumulative rewards in Markov Decision Processes (MDPs), focusing on both asymptotic and non-asymptotic settings. We introduce a unified approach to characterize reward concentration in MDPs, covering both infinite-horizon settings (i.e., average and discounted reward frameworks) and finite-horizon setting. Our asymptotic results include the law of large numbers, the central limit theorem, and the law of iterated logarithms, while our non-asymptotic bounds include Azuma-Hoeffding-type inequalities and a non-asymptotic version of the law of iterated logarithms. Additionally, we explore two key implications of our results. First, we analyze the sample path behavior of the difference in rewards between any two stationary policies. Second, we show that two alternative definitions of regret for learning policies proposed in the literature are rate-equivalent. Our proof techniques rely on a novel martingale decomposition of cumulative rewards, properties of the solution to the policy evaluation fixed-point equation, and both asymptotic and non-asymptotic concentration results for martingale difference sequences.
LGFeb 7, 2022
On learning Whittle index policy for restless bandits with scalable regretNima Akbarzadeh, Aditya Mahajan
Reinforcement learning is an attractive approach to learn good resource allocation and scheduling policies based on data when the system model is unknown. However, the cumulative regret of most RL algorithms scales as $\tilde O(\mathsf{S} \sqrt{\mathsf{A} T})$, where $\mathsf{S}$ is the size of the state space, $\mathsf{A}$ is the size of the action space, $T$ is the horizon, and the $\tilde{O}(\cdot)$ notation hides logarithmic terms. Due to the linear dependence on the size of the state space, these regret bounds are prohibitively large for resource allocation and scheduling problems. In this paper, we present a model-based RL algorithm for such problems which has scalable regret. In particular, we consider a restless bandit model, and propose a Thompson-sampling based learning algorithm which is tuned to the underlying structure of the model. We present two characterizations of the regret of the proposed algorithm with respect to the Whittle index policy. First, we show that for a restless bandit with $n$ arms and at most $m$ activations at each time, the regret scales either as $\tilde{O}(mn\sqrt{T})$ or $\tilde{O}(n^2 \sqrt{T})$ depending on the reward model. Second, under an additional technical assumption, we show that the regret scales as $\tilde{O}(n^{1.5} \sqrt{T})$ or $\tilde{O}(\max\{m\sqrt{n}, n\} \sqrt{T})$. We present numerical examples to illustrate the salient features of the algorithm.
LGDec 20, 2021
Strong Consistency and Rate of Convergence of Switched Least Squares System Identification for Autonomous Markov Jump Linear SystemsBorna Sayedana, Mohammad Afshari, Peter E. Caines et al.
In this paper, we investigate the problem of system identification for autonomous Markov jump linear systems (MJS) with complete state observations. We propose switched least squares method for identification of MJS, show that this method is strongly consistent, and derive data-dependent and data-independent rates of convergence. In particular, our data-independent rate of convergence shows that, almost surely, the system identification error is $\mathcal{O}\big(\sqrt{\log(T)/T} \big)$ where $T$ is the time horizon. These results show that switched least squares method for MJS has the same rate of convergence as least squares method for autonomous linear systems. We derive our results by imposing a general stability assumption on the model called stability in the average sense. We show that stability in the average sense is a weaker form of stability compared to the stability assumptions commonly imposed in the literature. We present numerical examples to illustrate the performance of the proposed method.
RONov 11, 2021
Scalable Operator Allocation for Multi-Robot Assistance: A Restless Bandit ApproachAbhinav Dahiya, Nima Akbarzadeh, Aditya Mahajan et al.
In this paper, we consider the problem of allocating human operators in a system with multiple semi-autonomous robots. Each robot is required to perform an independent sequence of tasks, subjected to a chance of failing and getting stuck in a fault state at every task. If and when required, a human operator can assist or teleoperate a robot. Conventional MDP techniques used to solve such problems face scalability issues due to exponential growth of state and action spaces with the number of robots and operators. In this paper we derive conditions under which the operator allocation problem is indexable, enabling the use of the Whittle index heuristic. The conditions can be easily checked to verify indexability, and we show that they hold for a wide range of problems of interest. Our key insight is to leverage the structure of the value function of individual robots, resulting in conditions that can be verified separately for each state of each robot. We apply these conditions to two types of transitions commonly seen in remote robot supervision systems. Through numerical simulations, we demonstrate the efficacy of Whittle index policy as a near-optimal and scalable approach that outperforms existing scalable methods.
SYAug 19, 2021
A relaxed technical assumption for posterior sampling-based reinforcement learning for control of unknown linear systemsMukul Gagrani, Sagar Sudhakara, Aditya Mahajan et al.
We revisit the Thompson sampling algorithm to control an unknown linear quadratic (LQ) system recently proposed by Ouyang et al (arXiv:1709.04047). The regret bound of the algorithm was derived under a technical assumption on the induced norm of the closed loop system. In this technical note, we show that by making a minor modification in the algorithm (in particular, ensuring that an episode does not end too soon), this technical assumption on the induced norm can be replaced by a milder assumption in terms of the spectral radius of the closed loop system. The modified algorithm has the same Bayesian regret of $\tilde{\mathcal{O}}(\sqrt{T})$, where $T$ is the time-horizon and the $\tilde{\mathcal{O}}(\cdot)$ notation hides logarithmic terms in~$T$.
SYAug 18, 2021
Scalable regret for learning to control network-coupled subsystems with unknown dynamicsSagar Sudhakara, Aditya Mahajan, Ashutosh Nayyar et al.
We consider the problem of controlling an unknown linear quadratic Gaussian (LQG) system consisting of multiple subsystems connected over a network. Our goal is to minimize and quantify the regret (i.e. loss in performance) of our strategy with respect to an oracle who knows the system model. Viewing the interconnected subsystems globally and directly using existing LQG learning algorithms for the global system results in a regret that increases super-linearly with the number of subsystems. Instead, we propose a new Thompson sampling based learning algorithm which exploits the structure of the underlying network. We show that the expected regret of the proposed algorithm is bounded by $\tilde{\mathcal{O}} \big( n \sqrt{T} \big)$ where $n$ is the number of subsystems, $T$ is the time horizon and the $\tilde{\mathcal{O}}(\cdot)$ notation hides logarithmic terms in $n$ and $T$. Thus, the regret scales linearly with the number of subsystems. We present numerical experiments to illustrate the salient features of the proposed algorithm.
NIJun 29, 2021
Structure-aware reinforcement learning for node-overload protection in mobile edge computingAnirudha Jitani, Aditya Mahajan, Zhongwen Zhu et al.
Mobile Edge Computing (MEC) refers to the concept of placing computational capability and applications at the edge of the network, providing benefits such as reduced latency in handling client requests, reduced network congestion, and improved performance of applications. The performance and reliability of MEC are degraded significantly when one or several edge servers in the cluster are overloaded. Especially when a server crashes due to the overload, it causes service failures in MEC. In this work, an adaptive admission control policy to prevent edge node from getting overloaded is presented. This approach is based on a recently-proposed low complexity RL (Reinforcement Learning) algorithm called SALMUT (Structure-Aware Learning for Multiple Thresholds), which exploits the structure of the optimal admission control policy in multi-class queues for an average-cost setting. We extend the framework to work for node overload-protection problem in a discounted-cost setting. The proposed solution is validated using several scenarios mimicking real-world deployments in two different settings - computer simulations and a docker testbed. Our empirical evaluations show that the total discounted cost incurred by SALMUT is similar to state-of-the-art deep RL algorithms such as PPO (Proximal Policy Optimization) and A2C (Advantage Actor Critic) but requires an order of magnitude less time to train, outputs easily interpretable policy, and can be deployed in an online manner.
SYNov 9, 2020
Thompson sampling for linear quadratic mean-field teamsMukul Gagrani, Sagar Sudhakara, Aditya Mahajan et al.
We consider optimal control of an unknown multi-agent linear quadratic (LQ) system where the dynamics and the cost are coupled across the agents through the mean-field (i.e., empirical mean) of the states and controls. Directly using single-agent LQ learning algorithms in such models results in regret which increases polynomially with the number of agents. We propose a new Thompson sampling based learning algorithm which exploits the structure of the system model and show that the expected Bayesian regret of our proposed algorithm for a system with agents of $|M|$ different types at time horizon $T$ is $\tilde{\mathcal{O}} \big( |M|^{1.5} \sqrt{T} \big)$ irrespective of the total number of agents, where the $\tilde{\mathcal{O}}$ notation hides logarithmic factors in $T$. We present detailed numerical experiments to illustrate the salient features of the proposed algorithm.
LGOct 17, 2020
Approximate information state for approximate planning and reinforcement learning in partially observed systemsJayakumar Subramanian, Amit Sinha, Raihan Seraj et al.
We propose a theoretical framework for approximate planning and learning in partially observed systems. Our framework is based on the fundamental notion of information state. We provide two equivalent definitions of information state -- i) a function of history which is sufficient to compute the expected reward and predict its next value; ii) equivalently, a function of the history which can be recursively updated and is sufficient to compute the expected reward and predict the next observation. An information state always leads to a dynamic programming decomposition. Our key result is to show that if a function of the history (called approximate information state (AIS)) approximately satisfies the properties of the information state, then there is a corresponding approximate dynamic program. We show that the policy computed using this is approximately optimal with bounded loss of optimality. We show that several approximations in state, observation and action spaces in literature can be viewed as instances of AIS. In some of these cases, we obtain tighter bounds. A salient feature of AIS is that it can be learnt from data. We present AIS based multi-time scale policy gradient algorithms. and detailed numerical experiments with low, moderate and high dimensional environments.
SYApr 24, 2020
Decentralized linear quadratic systems with major and minor agents and non-Gaussian noiseMohammad Afshari, Aditya Mahajan
A decentralized linear quadratic system with a major agent and a collection of minor agents is considered. The major agent affects the minor agents, but not vice versa. The state of the major agent is observed by all agents. In addition, the minor agents have a noisy observation of their local state. The noise processes is \emph{not} assumed to be Gaussian. The structures of the optimal strategy and the best linear strategy are characterized. It is shown that major agent's optimal control action is a linear function of the major agent's MMSE (minimum mean squared error) estimate of the system state while the minor agent's optimal control action is a linear function of the major agent's MMSE estimate of the system state and a "correction term" which depends on the difference of the minor agent's MMSE estimate of its local state and the major agent's MMSE estimate of the minor agent's local state. Since the noise is non-Gaussian, the minor agent's MMSE estimate is a non-linear function of its observation. It is shown that replacing the minor agent's MMSE estimate by its LLMS (linear least mean square) estimate gives the best linear control strategy. The results are proved using a direct method based on conditional independence, common-information-based splitting of state and control actions, and simplifying the per-step cost based on conditional independence, orthogonality principle, and completion of squares.
SYSep 18, 2018
Linear Quadratic Mean Field Teams: Optimal and Approximately Optimal Decentralized SolutionsJalal Arabneydi, Aditya Mahajan
We consider team optimal control of decentralized systems with linear dynamics, quadratic costs, and arbitrary disturbance that consist of multiple sub-populations with exchangeable agents (i.e., exchanging two agents within the same sub-population does not affect the dynamics or the cost). Such a system is equivalent to one where the dynamics and costs are coupled across agents through the mean-field (or empirical mean) of the states and actions (even when the primitive random variables are non-exchangeable). Two information structures are investigated. In the first, all agents observe their local state and the mean-field of all sub-populations, in the second, all agents observe their local state but the mean-field of only a subset of the sub-populations. Both information structures are non-classical and not partially nested. Nonetheless, it is shown that linear control strategies are optimal for the first and approximately optimal for the second, the approximation error is inversely proportional to the size of the sub-populations whose mean-fields are not observed. The corresponding gains are determined by the solution of K+1 decoupled standard Riccati equations, where K is the number of sub-populations. The dimensions of the Riccati equations do not depend on the size of the sub-populations, thus the solution complexity is independent of the number of agents. Generalizations to major-minor agents, tracking cost, weighted mean-field, and infinite horizon are provided. The results are illustrated using an example of demand response in smart grids.
LGApr 3, 2018
Renewal Monte Carlo: Renewal theory based reinforcement learningJayakumar Subramanian, Aditya Mahajan
In this paper, we present an online reinforcement learning algorithm, called Renewal Monte Carlo (RMC), for infinite horizon Markov decision processes with a designated start state. RMC is a Monte Carlo algorithm and retains the advantages of Monte Carlo methods including low bias, simplicity, and ease of implementation while, at the same time, circumvents their key drawbacks of high variance and delayed (end of episode) updates. The key ideas behind RMC are as follows. First, under any reasonable policy, the reward process is ergodic. So, by renewal theory, the performance of a policy is equal to the ratio of expected discounted reward to the expected discounted time over a regenerative cycle. Second, by carefully examining the expression for performance gradient, we propose a stochastic approximation algorithm that only requires estimates of the expected discounted reward and discounted time over a regenerative cycle and their gradients. We propose two unbiased estimators for evaluating performance gradients---a likelihood ratio based estimator and a simultaneous perturbation based estimator---and show that for both estimators, RMC converges to a locally optimal policy. We generalize the RMC algorithm to post-decision state models and also present a variant that converges faster to an approximately optimal policy. We conclude by presenting numerical experiments on a randomly generated MDP, event-triggered communication, and inventory management.
CYApr 17, 2013
Forensic Analysis of Instant Messenger Applications on Android DevicesAditya Mahajan, M. S. Dahiya, H. P. Sanghvi
This paper focuses on conducting forensic data analysis of 2 widely used IMs applications on Android phones WhatsApp and Viber. The tests and analysis were performed with the aim of determining what data and information can be found on the devices internal memory for instant messengers eg chat messaging logs and history send & received image or video files etc. The experiments and results show that heavy amount of potential evidences and valuable data can be found on Android phones by forensic investigators.