LGAIMLAug 12, 2021

A general class of surrogate functions for stable and efficient reinforcement learning

arXiv:2108.05828v521 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of unstable and inefficient reinforcement learning for researchers and practitioners by providing a theoretically grounded framework, though it is incremental as it builds on existing methods like TRPO and PPO.

The authors tackled the lack of theoretical guarantees in policy gradient methods by proposing a general framework (FMA-PG) that generates surrogate functions with policy improvement guarantees, independent of policy parameterization, and demonstrated improved robustness and efficiency in experiments on MuJoCo.

Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes