LGMLMay 28, 2025

Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training

IBM
arXiv:2505.22257v227 citationsh-index: 26
Originality Incremental advance
AI Analysis

This work addresses training efficiency and stability in reinforcement learning, but it is incremental as it builds on existing GRPO and off-policy PPO methods.

The paper revisits Group Relative Policy Optimization (GRPO) by adapting it to off-policy training, showing that both on-policy and off-policy variants improve rewards, with off-policy GRPO significantly outperforming or matching on-policy GRPO in empirical tests.

We revisit Group Relative Policy Optimization (GRPO) in both on-policy and off-policy optimization regimes. Our motivation comes from recent work on off-policy Proximal Policy Optimization (PPO), which improves training stability, sampling efficiency, and memory usage. In addition, a recent analysis of GRPO suggests that estimating the advantage function with off-policy samples could be beneficial. Building on these observations, we adapt GRPO to the off-policy setting. We show that both on-policy and off-policy GRPO objectives yield an improvement in the reward. This result motivates the use of clipped surrogate objectives in the off-policy version of GRPO. We then compare the empirical performance of reinforcement learning with verifiable rewards in post-training using both GRPO variants. Our results show that off-policy GRPO either significantly outperforms or performs on par with its on-policy counterpart.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes