LGAINov 1, 2024

Beyond the Boundaries of Proximal Policy Optimization

arXiv:2411.00666v13 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses incremental improvements in reinforcement learning algorithms for researchers and practitioners by challenging implicit design choices in PPO.

The authors proposed outer-PPO, a framework that decouples update estimation and application in Proximal Policy Optimization, allowing the use of arbitrary gradient-based optimizers. Empirical results showed that non-unity learning rates and momentum achieved statistically significant improvements on Brax and Jumanji environments compared to a tuned PPO baseline.

Proximal policy optimization (PPO) is a widely-used algorithm for on-policy reinforcement learning. This work offers an alternative perspective of PPO, in which it is decomposed into the inner-loop estimation of update vectors, and the outer-loop application of updates using gradient ascent with unity learning rate. Using this insight we propose outer proximal policy optimization (outer-PPO); a framework wherein these update vectors are applied using an arbitrary gradient-based optimizer. The decoupling of update estimation and update application enabled by outer-PPO highlights several implicit design choices in PPO that we challenge through empirical investigation. In particular we consider non-unity learning rates and momentum applied to the outer loop, and a momentum-bias applied to the inner estimation loop. Methods are evaluated against an aggressively tuned PPO baseline on Brax, Jumanji and MinAtar environments; non-unity learning rates and momentum both achieve statistically significant improvement on Brax and Jumanji, given the same hyperparameter tuning budget.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes