PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation
This work addresses a specific bottleneck in reinforcement learning for continuous control tasks, offering an incremental improvement over PPO.
The authors tackled the problem of premature exploration variance shrinkage in Proximal Policy Optimization (PPO) for continuous action spaces, which causes slow progress and local optima, by proposing PPO-CMA, which adaptively expands variance and improves performance on Roboschool benchmarks with less hyperparameter sensitivity.
Proximal Policy Optimization (PPO) is a highly popular model-free reinforcement learning (RL) approach. However, we observe that in a continuous action space, PPO can prematurely shrink the exploration variance, which leads to slow progress and may make the algorithm prone to getting stuck in local optima. Drawing inspiration from CMA-ES, a black-box evolutionary optimization method designed for robustness in similar situations, we propose PPO-CMA, a proximal policy optimization approach that adaptively expands the exploration variance to speed up progress. With only minor changes to PPO, our algorithm considerably improves performance in Roboschool continuous control benchmarks. Our results also show that PPO-CMA, as opposed to PPO, is significantly less sensitive to the choice of hyperparameters, allowing one to use it in complex movement optimization tasks without requiring tedious tuning.