AIFeb 21, 2018

Convergent Actor-Critic Algorithms Under Off-Policy Training and Function Approximation

arXiv:1802.07842v146 citations
Originality Highly original
AI Analysis

This addresses a foundational problem in reinforcement learning for researchers and practitioners dealing with high-dimensional action spaces, offering a novel solution to enable stable and efficient off-policy learning.

The paper tackles the problem of off-policy training in reinforcement learning with continuous or large action sets, where estimating state-action value functions is infeasible, by introducing convergent Actor-Critic algorithms that use state-value functions to lift the curse of dimensionality. The result is the first class of policy-gradient algorithms, Gradient Actor-Critic and Emphatic Actor-Critic, guaranteed to converge under off-policy training with function approximation, maintaining desirable properties without additional hyper-parameters.

We present the first class of policy-gradient algorithms that work with both state-value and policy function-approximation, and are guaranteed to converge under off-policy training. Our solution targets problems in reinforcement learning where the action representation adds to the-curse-of-dimensionality; that is, with continuous or large action sets, thus making it infeasible to estimate state-action value functions (Q functions). Using state-value functions helps to lift the curse and as a result naturally turn our policy-gradient solution into classical Actor-Critic architecture whose Actor uses state-value function for the update. Our algorithms, Gradient Actor-Critic and Emphatic Actor-Critic, are derived based on the exact gradient of averaged state-value function objective and thus are guaranteed to converge to its optimal solution, while maintaining all the desirable properties of classical Actor-Critic methods with no additional hyper-parameters. To our knowledge, this is the first time that convergent off-policy learning methods have been extended to classical Actor-Critic methods with function approximation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes