LGMLSep 23, 2020

Revisiting Design Choices in Proximal Policy Optimization

arXiv:2009.10897v166 citations
Originality Incremental advance
AI Analysis

This work highlights that common PPO design choices are environment-specific, cautioning against their uncritical adoption in broader reinforcement learning applications.

The authors identified three failure modes of standard Proximal Policy Optimization (PPO) when applied outside typical benchmarks, showing that alternative design choices for surrogate objectives and policy parameterizations can prevent these issues.

Proximal Policy Optimization (PPO) is a popular deep policy gradient algorithm. In standard implementations, PPO regularizes policy updates with clipped probability ratios, and parameterizes policies with either continuous Gaussian distributions or discrete Softmax distributions. These design choices are widely accepted, and motivated by empirical performance comparisons on MuJoCo and Atari benchmarks. We revisit these practices outside the regime of current benchmarks, and expose three failure modes of standard PPO. We explain why standard design choices are problematic in these cases, and show that alternative choices of surrogate objectives and policy parameterizations can prevent the failure modes. We hope that our work serves as a reminder that many algorithmic design choices in reinforcement learning are tied to specific simulation environments. We should not implicitly accept these choices as a standard part of a more general algorithm.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes