ML AIJan 10, 2018

Expected Policy Gradients for Reinforcement Learning

arXiv:1801.03326v219.361 citations

Originality Incremental advance

AI Analysis

This work addresses a key bottleneck in reinforcement learning by providing a more efficient and general policy gradient method, though it is incremental as it builds upon and unifies existing stochastic and deterministic approaches.

The paper tackles the problem of high variance in policy gradient methods for reinforcement learning by proposing expected policy gradients (EPG), which unify stochastic and deterministic approaches and integrate across actions to estimate gradients. The result shows that EPG reduces variance without deterministic policies and outperforms existing methods on multiple challenging control domains, as demonstrated in extensive experiments.

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first derive a practical result for Gaussian policies and quadratic critics and then extend it to a universal analytical method, covering a broad class of actors and critics, including Gaussian, exponential families, and policies with bounded support. For Gaussian policies, we introduce an exploration method that uses covariance proportional to the matrix exponential of the scaled Hessian of the critic with respect to the actions. For discrete action spaces, we derive a variant of EPG based on softmax policies. We also establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. Furthermore, we prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and with little computational overhead. Finally, we provide an extensive experimental evaluation of EPG and show that it outperforms existing approaches on multiple challenging control domains.

View on arXiv PDF

Similar