LGMay 20, 2022

The Sufficiency of Off-Policyness and Soft Clipping: PPO is still Insufficient according to an Off-Policy Measure

Xing Chen, Dongcui Diao, Hechang Chen, Hengshuai Yao, Haiyin Piao, Zhixiao Sun, Zhiwei Yang, Randy Goebel, Bei Jiang, Yi Chang

arXiv:2205.10047v614.128 citationsh-index: 30Has Code

Originality Incremental advance

AI Analysis

This work addresses a fundamental bottleneck in policy gradient methods for reinforcement learning, offering an incremental improvement over PPO by enhancing off-policy exploration and optimization.

The paper tackles the limitation of Proximal Policy Optimization (PPO) by showing that better policies exist outside its clipped policy space, using a novel surrogate objective with a sigmoid function to explore a larger space and maximize the Conservative Policy Iteration objective more effectively than PPO.

The popular Proximal Policy Optimization (PPO) algorithm approximates the solution in a clipped policy space. Does there exist better policies outside of this space? By using a novel surrogate objective that employs the sigmoid function (which provides an interesting way of exploration), we found that the answer is ``YES'', and the better policies are in fact located very far from the clipped space. We show that PPO is insufficient in ``off-policyness'', according to an off-policy metric called DEON. Our algorithm explores in a much larger policy space than PPO, and it maximizes the Conservative Policy Iteration (CPI) objective better than PPO during training. To the best of our knowledge, all current PPO methods have the clipping operation and optimize in the clipped policy space. Our method is the first of this kind, which advances the understanding of CPI optimization and policy gradient methods. Code is available at https://github.com/raincchio/P3O.

View on arXiv PDF Code

Similar