LGAISYJul 19, 2022

Actor-Critic based Improper Reinforcement Learning

arXiv:2207.09090v14 citationsh-index: 81
Originality Incremental advance
AI Analysis

This addresses the problem of efficiently tuning controllers for target environments with few trials, which is incremental as it builds on existing actor-critic methods.

The paper tackles the problem of improper reinforcement learning by optimally combining given base controllers to outperform each individually, using actor-critic methods, and shows that the proposed algorithm can stabilize systems even with unstable base policies in benchmarks like cartpole and queueing tasks.

We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target environment with relatively few trials. Towards this, we propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic (AC) based scheme and a Natural Actor-Critic (NAC) scheme depending on the available information. Both algorithms operate over a class of improper mixtures of the given controllers. For the first case, we derive convergence rate guarantees assuming access to a gradient oracle. For the AC-based approach we provide convergence rate guarantees to a stationary point in the basic AC case and to a global optimum in the NAC case. Numerical results on (i) the standard control theoretic benchmark of stabilizing an cartpole; and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes