PRPO: Aligning Process Reward with Outcome Reward in Policy Optimization
This work addresses the challenge of aligning process and outcome rewards for more efficient policy optimization in language models, representing an incremental improvement over existing critic-free methods.
The paper tackles the problem of sparse reward signals in policy optimization for large language models in multi-step reasoning tasks by introducing Process Relative Policy Optimization (PRPO), which combines outcome and process rewards to improve fine-grained credit assignment, resulting in an accuracy increase from 61.2% to 64.4% on MATH500 with Qwen2.5-Math-1.5B using only eight rollouts.
Policy optimization for large language models often suffers from sparse reward signals in multi-step reasoning tasks. Critic-free methods like GRPO assign a single normalized outcome reward to all tokens, providing limited guidance for intermediate reasoning . While Process Reward Models (PRMs) offer dense feedback, they risk premature collapse when used alone, as early low-reward tokens can drive policies toward truncated outputs. We introduce Process Relative Policy Optimization (PRPO), which combines outcome reliability with process-level guidance in a critic-free framework. PRPO segments reasoning sequences based on semantic clues, normalizes PRM scores into token-level advantages, and aligns their distribution with outcome advantages through location-parameter shift. On MATH500, PRPO improves Qwen2.5-Math-1.5B accuracy from 61.2% to 64.4% over GRPO using only eight rollouts and no value network, demonstrating efficient fine-grained credit assignment within critic-free optimization. Code is available at: https://github.com/SchumiDing/srpocode