LG AI MLJul 2, 2018

Policy Optimization With Penalized Point Probability Distance: An Alternative To Proximal Policy Optimization

arXiv:1807.00442v49 citationsHas Code

Originality Incremental advance

AI Analysis

This is an incremental improvement for reinforcement learning practitioners, offering an alternative to PPO with potential efficiency gains.

The paper introduces POP3D, a first-order gradient reinforcement learning algorithm that addresses shortcomings in existing methods like PPO, achieving competitive performance on common benchmarks.

As the most successful variant and improvement for Trust Region Policy Optimization (TRPO), proximal policy optimization (PPO) has been widely applied across various domains with several advantages: efficient data utilization, easy implementation, and good parallelism. In this paper, a first-order gradient reinforcement learning algorithm called Policy Optimization with Penalized Point Probability Distance (POP3D), which is a lower bound to the square of total variance divergence is proposed as another powerful variant. Firstly, we talk about the shortcomings of several commonly used algorithms, by which our method is partly motivated. Secondly, we address to overcome these shortcomings by applying POP3D. Thirdly, we dive into its mechanism from the perspective of solution manifold. Finally, we make quantitative comparisons among several state-of-the-art algorithms based on common benchmarks. Simulation results show that POP3D is highly competitive compared with PPO. Besides, our code is released in https://github.com/paperwithcode/pop3d.

View on arXiv PDF Code

Similar