LGAIROMay 25, 2025

Improving Value Estimation Critically Enhances Vanilla Policy Gradient

arXiv:2505.19247v15 citationsh-index: 5ICML
Originality Incremental advance
AI Analysis

This work addresses the problem of making reinforcement learning algorithms more effective and easier to use for practitioners, though it is incremental as it builds on existing methods.

The paper challenges the belief that trust regions are key to policy gradient success, showing that improving value estimation accuracy by increasing value update steps enables vanilla policy gradient to match or exceed PPO performance on continuous control benchmarks, with greater robustness to hyperparameters.

Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes