LGMLDec 11, 2019

Doubly Robust Off-Policy Actor-Critic Algorithms for Reinforcement Learning

arXiv:1912.05109v11 citations
Originality Incremental advance
AI Analysis

This work addresses stability issues in reinforcement learning algorithms, particularly for continuous control and safe RL applications, though it represents an incremental improvement by extending existing doubly robust estimators to actor-critic methods.

The paper tackles the problem of high variance and instability in off-policy actor-critic algorithms by introducing a doubly robust estimator for critic evaluation, which significantly improves performance in continuous control tasks and enhances robustness under stochastic or corrupted reward signals.

We study the problem of off-policy critic evaluation in several variants of value-based off-policy actor-critic algorithms. Off-policy actor-critic algorithms require an off-policy critic evaluation step, to estimate the value of the new policy after every policy gradient update. Despite enormous success of off-policy policy gradients on control tasks, existing general methods suffer from high variance and instability, partly because the policy improvement depends on gradient of the estimated value function. In this work, we present a new way of off-policy policy evaluation in actor-critic, based on the doubly robust estimators. We extend the doubly robust estimator from off-policy policy evaluation (OPE) to actor-critic algorithms that consist of a reward estimator performance model. We find that doubly robust estimation of the critic can significantly improve performance in continuous control tasks. Furthermore, in cases where the reward function is stochastic that can lead to high variance, doubly robust critic estimation can improve performance under corrupted, stochastic reward signals, indicating its usefulness for robust and safe reinforcement learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes