LGAIOct 9, 2019

Investigation on the generalization of the Sampled Policy Gradient algorithm

arXiv:1910.03728v1
Originality Synthesis-oriented
AI Analysis

This is an incremental study evaluating a new offline actor-critic variant for reinforcement learning tasks.

The paper compared the Sampled Policy Gradient (SPG) algorithm to CACLA and DPG across environments and network architectures, finding that SPG often did not perform worst but did not consistently match the best algorithm.

The Sampled Policy Gradient (SPG) algorithm is a new offline actor-critic variant that samples in the action space to approximate the policy gradient. It does so by using the critic to evaluate the sampled actions. SPG offers theoretical promise over similar algorithms such as DPG as it searches the action-Q-value space independently of the local gradient, enabling it to avoid local minima. This paper aims to compare SPG to two similar actor-critic algorithms, CACLA and DPG. The comparison is made across two different environments, two different network architectures, as well as training on on-policy transitions in contrast to using an experience buffer. Results seem to show that although SPG does often not perform the worst, it doesn't always match the performance of the best performing algorithm at a particular task. Further experiments are required to get a better estimate of the qualities of SPG.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes