LGAIMLMar 27, 2019

Generalized Off-Policy Actor-Critic

arXiv:1903.11329v845 citations
Originality Highly original
AI Analysis

This work addresses performance prediction issues in off-policy RL for robotics and simulation tasks, representing a novel method rather than an incremental improvement.

The authors tackled the problem of misleading performance predictions in off-policy policy gradient algorithms for continuing RL by proposing a new counterfactual objective, which better predicts target policy performance and led to the first empirical success of emphatic algorithms in deep RL benchmarks like Mujoco simulations.

We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes