Optimistic Proximal Policy Optimization
This addresses a specific challenge in reinforcement learning for domains with sparse rewards, but appears incremental as it builds on existing proximal policy optimization methods.
The paper tackles the difficulty of learning good policies in reinforcement learning when rewards are rare by proposing Optimistic Proximal Policy Optimization (OPPO), which optimistically evaluates policies based on uncertainty in estimated total returns, and shows that OPPO outperforms existing methods in a tabular task.
Reinforcement Learning, a machine learning framework for training an autonomous agent based on rewards, has shown outstanding results in various domains. However, it is known that learning a good policy is difficult in a domain where rewards are rare. We propose a method, optimistic proximal policy optimization (OPPO) to alleviate this difficulty. OPPO considers the uncertainty of the estimated total return and optimistically evaluates the policy based on that amount. We show that OPPO outperforms the existing methods in a tabular task.