A K-fold Method for Baseline Estimation in Policy Gradient Algorithms
This work addresses a specific issue in reinforcement learning for policy gradient methods, but it appears incremental as it builds on existing baseline techniques.
The paper tackles the underfitting or overfitting problem in baseline estimation for policy gradient algorithms by developing a K-fold method that adjusts the bias-variance trade-off, demonstrating its usefulness on three MuJoCo locomotive control tasks with two state-of-the-art algorithms.
The high variance issue in unbiased policy-gradient methods such as VPG and REINFORCE is typically mitigated by adding a baseline. However, the baseline fitting itself suffers from the underfitting or the overfitting problem. In this paper, we develop a K-fold method for baseline estimation in policy gradient algorithms. The parameter K is the baseline estimation hyperparameter that can adjust the bias-variance trade-off in the baseline estimates. We demonstrate the usefulness of our approach via two state-of-the-art policy gradient algorithms on three MuJoCo locomotive control tasks.