LGAIMLJan 10, 2013

The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

arXiv:1301.2315v1273 citations
Originality Incremental advance
AI Analysis

This addresses a practical bottleneck in reinforcement learning for researchers and practitioners by reducing variance in policy gradients, though it is incremental as it builds on existing gradient-based approaches.

The paper tackles the high variance problem in gradient-based reinforcement learning by incorporating a reward baseline, showing that the optimal constant baseline equals the long-term average expected reward to minimize variance without bias. Experiments demonstrate improved performance over previous methods.

There exist a number of reinforcement learning algorithms which learnby climbing the gradient of expected reward. Their long-runconvergence has been proved, even in partially observableenvironments with non-deterministic actions, and without the need fora system model. However, the variance of the gradient estimator hasbeen found to be a significant practical problem. Recent approacheshave discounted future rewards, introducing a bias-variance trade-offinto the gradient estimate. We incorporate a reward baseline into thelearning system, and show that it affects variance without introducingfurther bias. In particular, as we approach the zero-bias,high-variance parameterization, the optimal (or variance minimizing)constant reward baseline is equal to the long-term average expectedreward. Modified policy-gradient algorithms are presented, and anumber of experiments demonstrate their improvement over previous work.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes