LGAIOct 27, 2025

Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

arXiv:2510.23049v24 citationsh-index: 7
Originality Synthesis-oriented
AI Analysis

This work provides a unifying perspective for RL researchers, but it is incremental as it clarifies existing methods rather than introducing new ones.

The paper tackles the problem of reconciling two distinct policy gradient approaches for the Pass@K objective in reinforcement learning with verifiable rewards, showing they are equivalent by revealing that advantage-shaping methods implicitly optimize surrogate rewards and providing a unified framework.

This note reconciles two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards: (1) direct REINFORCE-style methods, and (2) advantage-shaping techniques that directly modify GRPO. We show that these are two sides of the same coin. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical "hard-example up-weighting" modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods. This perspective provides a lens for RLVR policy gradient optimization beyond our original motivation of Pass@K.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes