ML LGMar 6, 2017

Revisiting stochastic off-policy action-value gradients

arXiv:1703.02102v2

Originality Synthesis-oriented

AI Analysis

This work addresses an incremental improvement in reinforcement learning algorithms for researchers and practitioners, focusing on off-policy stochastic methods.

The paper tackles the problem of deriving optimal policies in off-policy stochastic actor-critic methods by approximating action-value gradients, which enable policy improvement along the steepest ascent direction, and it discusses an incremental approach for following the policy gradient without using the natural gradient.

Off-policy stochastic actor-critic methods rely on approximating the stochastic policy gradient in order to derive an optimal policy. One may also derive the optimal policy by approximating the action-value gradient. The use of action-value gradients is desirable as policy improvement occurs along the direction of steepest ascent. This has been studied extensively within the context of natural gradient actor-critic algorithms and more recently within the context of deterministic policy gradients. In this paper we briefly discuss the off-policy stochastic counterpart to deterministic action-value gradients, as well as an incremental approach for following the policy gradient in lieu of the natural gradient.

View on arXiv PDF

Similar