LGCLMay 29, 2025

On-Policy RL with Optimal Reward Baseline

arXiv:2505.23585v229 citationsh-index: 41
Originality Incremental advance
AI Analysis

This work addresses stability and efficiency issues in reinforcement learning for aligning large language models with human preferences, representing an incremental improvement over existing methods.

The paper tackled training instability and computational inefficiency in reinforcement learning for large language models by proposing the OPO algorithm, which achieved superior performance and stability on mathematical reasoning benchmarks without extra models or regularization.

Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO integrates a practically feasible formulation of the optimal reward baseline that minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is merged into the verl library at https://verl.readthedocs.io/en/latest/algo/opo.html.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes