LGJul 19, 2022
Actor-Critic based Improper Reinforcement LearningMohammadi Zaki, Avinash Mohan, Aditya Gopalan et al.
We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target environment with relatively few trials. Towards this, we propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic (AC) based scheme and a Natural Actor-Critic (NAC) scheme depending on the available information. Both algorithms operate over a class of improper mixtures of the given controllers. For the first case, we derive convergence rate guarantees assuming access to a gradient oracle. For the AC-based approach we provide convergence rate guarantees to a stationary point in the basic AC case and to a global optimum in the NAC case. Numerical results on (i) the standard control theoretic benchmark of stabilizing an cartpole; and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable.
NIMay 24, 2021
A Low-Delay MAC for IoT Applications: Decentralized Optimal Scheduling of Queues without Explicit State Information SharingAvinash Mohan, Arpan Chattopadhyay, Shivam Vinayak Vatsa et al.
We consider a system of several collocated nodes sharing a time slotted wireless channel, and seek a MAC (medium access control) that (i) provides low mean delay, (ii) has distributed control (i.e., there is no central scheduler), and (iii) does not require explicit exchange of state information or control signals. The design of such MAC protocols must keep in mind the need for contention access at light traffic, and scheduled access in heavy traffic, leading to the long-standing interest in hybrid, adaptive MACs. Working in the discrete time setting, for the distributed MAC design, we consider a practical information structure where each node has local information and some common information obtained from overhearing. In this setting, "ZMAC" is an existing protocol that is hybrid and adaptive. We approach the problem via two steps (1) We show that it is sufficient for the policy to be "greedy" and "exhaustive". Limiting the policy to this class reduces the problem to obtaining a queue switching policy at queue emptiness instants. (2) Formulating the delay optimal scheduling as a POMDP (partially observed Markov decision process), we show that the optimal switching rule is Stochastic Largest Queue (SLQ). Using this theory as the basis, we then develop a practical distributed scheduler, QZMAC, which is also tunable. We implement QZMAC on standard off-the-shelf TelosB motes and also use simulations to compare QZMAC with the full-knowledge centralized scheduler, and with ZMAC. We use our implementation to study the impact of false detection while overhearing the common information, and the efficiency of QZMAC. Our simulation results show that the mean delay with QZMAC is close that of the full-knowledge centralized scheduler.
LGFeb 16, 2021
Improper Reinforcement Learning with Gradient-based Policy OptimizationMohammadi Zaki, Avinash Mohan, Aditya Gopalan et al.
We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target environment with relatively few trials. \par We propose a gradient-based approach that operates over a class of improper mixtures of the controllers. We derive convergence rate guarantees for the approach assuming access to a gradient oracle. The value function of the mixture and its gradient may not be available in closed-form; however, we show that we can employ rollouts and simultaneous perturbation stochastic approximation (SPSA) for explicit gradient descent optimization. Numerical results on (i) the standard control theoretic benchmark of stabilizing an inverted pendulum and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable\footnote{Under review. Please do not distribute.}.
LGNov 5, 2019
Towards Optimal and Efficient Best Arm Identification in Linear BanditsMohammadi Zaki, Avinash Mohan, Aditya Gopalan
We give a new algorithm for best arm identification in linearly parameterised bandits in the fixed confidence setting. The algorithm generalises the well-known LUCB algorithm of Kalyanakrishnan et al. (2012) by playing an arm which minimises a suitable notion of geometric overlap of the statistical confidence set for the unknown parameter, and is fully adaptive and computationally efficient as compared to several state-of-the methods. We theoretically analyse the sample complexity of the algorithm for problems with two and three arms, showing optimality in many cases. Numerical results indicate favourable performance over other algorithms with which we compare.