LG AIFeb 4, 2022

A Temporal-Difference Approach to Policy Gradient Estimation

Samuele Tosatto, Andrew Patterson, Martha White, A. Rupam Mahmood

arXiv:2202.02396v44.62 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses a key issue in reinforcement learning for practitioners by providing a model-free estimator that avoids distribution shift, though it appears incremental as it builds on existing policy gradient theory.

The paper tackles the distribution shift problem in policy gradient estimation by proposing a new approach that reconstructs the gradient from the start state using a gradient critic with temporal-difference updates, proving it is unbiased under certain conditions and showing it achieves better bias-variance trade-off and performance with off-policy samples.

The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.

View on arXiv PDF Code

Similar