LGAISep 25, 2025

GRPO is Secretly a Process Reward Model

arXiv:2509.21154v24 citationsh-index: 1
Originality Highly original
AI Analysis

This work addresses inefficiencies in reinforcement learning for large language models by revealing and fixing a hidden flaw in GRPO, offering a low-cost alternative to explicit PRMs for researchers and practitioners.

The paper proves that the GRPO RL algorithm inherently induces a process reward model (PRM) under certain assumptions, identifies a flaw in its objective related to non-uniform process steps, and proposes a modified algorithm (λ-GRPO) that improves validation accuracy and downstream task performance, reaching peak performance more rapidly than standard GRPO.

We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group overlap of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($λ$-GRPO), and show that LLMs trained with $λ$-GRPO achieve higher validation accuracy and performance on downstream reasoning tasks$-$and reach peak performance more rapidly$-$than LLMs trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes