LG AIDec 26, 2020

POPO: Pessimistic Offline Policy Optimization

arXiv:2012.13682v27.210 citationsHas Code

Originality Incremental advance

AI Analysis

This work is significant for researchers and practitioners in reinforcement learning, as it provides a novel algorithm to overcome the challenges of learning effective policies from static, pre-recorded datasets without environment interaction, an incremental improvement over existing methods.

This paper addresses the failure of off-policy reinforcement learning methods in offline settings by proposing Pessimistic Offline Policy Optimization (POPO). POPO learns a pessimistic value function, which enables it to perform surprisingly well and scale to high-dimensional state and action spaces, outperforming or matching several state-of-the-art offline RL algorithms on benchmark tasks.

Offline reinforcement learning (RL), also known as batch RL, aims to optimize policy from a large pre-recorded dataset without interaction with the environment. This setting offers the promise of utilizing diverse, pre-collected datasets to obtain policies without costly, risky, active exploration. However, commonly used off-policy algorithms based on Q-learning or actor-critic perform poorly when learning from a static dataset. In this work, we study why off-policy RL methods fail to learn in offline setting from the value function view, and we propose a novel offline RL algorithm that we call Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function to get a strong policy. We find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space, comparing or outperforming several state-of-the-art offline RL algorithms on benchmark tasks.

View on arXiv PDF Code

Similar