LGAIJun 21, 2021

OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation

arXiv:2106.10783v1141 citations
Originality Incremental advance
AI Analysis

This addresses offline RL for agents that must learn from fixed datasets without environment interaction, but it is incremental as it builds on prior methods to mitigate overestimation more principledly.

The paper tackles the problem of distributional shift in offline reinforcement learning, which causes overestimation of action values, by introducing OptiDICE, an algorithm that directly estimates stationary distribution corrections of the optimal policy without policy-gradients, and it performs competitively with state-of-the-art methods on benchmark datasets.

We consider the offline reinforcement learning (RL) setting where the agent aims to optimize the policy solely from the data without further environment interactions. In offline RL, the distributional shift becomes the primary source of difficulty, which arises from the deviation of the target policy being optimized from the behavior policy used for data collection. This typically causes overestimation of action values, which poses severe problems for model-free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms often used sophisticated techniques that encourage underestimation of action values, which introduces an additional set of hyperparameters that need to be tuned properly. In this paper, we present an offline RL algorithm that prevents overestimation in a more principled way. Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline RL algorithms. Using an extensive set of benchmark datasets for offline RL, we show that OptiDICE performs competitively with the state-of-the-art methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes