LGApr 20, 2022

Learning to Constrain Policy Optimization with Virtual Trust Region

arXiv:2204.09315v25 citationsh-index: 58
Originality Incremental advance
AI Analysis

This addresses the challenge of stable policy optimization in reinforcement learning, particularly when old policies perform poorly, though it appears incremental as an extension of existing constrained policy gradient methods.

The paper tackles the problem of constrained optimization in policy gradient reinforcement learning by introducing a virtual trust region that regulates policy updates using a memory of past policies, resulting in competitive performance across robotic locomotion, sparse-reward navigation, and Atari games.

We introduce a constrained optimization method for policy gradient reinforcement learning, which uses a virtual trust region to regulate each policy update. In addition to using the proximity of one single old policy as the normal trust region, we propose forming a second trust region through another virtual policy representing a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial if the old policy performs poorly. More importantly, we propose a mechanism to automatically build the virtual policy from a memory of past policies, providing a new capability for dynamically learning appropriate virtual trust regions during the optimization process. Our proposed method, dubbed Memory-Constrained Policy Optimization (MCPO), is examined in diverse environments, including robotic locomotion control, navigation with sparse rewards and Atari games, consistently demonstrating competitive performance against recent on-policy constrained policy gradient methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes