On the Plasticity and Stability for Post-Training Large Language Models
This work addresses a critical problem in post-training large language models for researchers and practitioners, offering an incremental improvement by refining gradient handling in GRPO.
The paper tackles the training stability bottleneck in Group Relative Policy Optimization (GRPO) by addressing the geometric conflict between plasticity and stability gradients, which causes destructive interference. It proposes Probabilistic Conflict Resolution (PCR), a Bayesian framework that uses uncertainty-aware soft projection to dynamically arbitrate conflicts, resulting in significantly smoothed training trajectories and superior performance in reasoning tasks.
Training stability remains a critical bottleneck for Group Relative Policy Optimization (GRPO), often manifesting as a trade-off between reasoning plasticity and general capability retention. We identify a root cause as the geometric conflict between plasticity and stability gradients, which leads to destructive interference. Crucially, we argue that deterministic projection methods are suboptimal for GRPO as they overlook the intrinsic stochasticity of group-based gradient estimates. To address this, we propose Probabilistic Conflict Resolution (PCR), a Bayesian framework that models gradients as random variables. PCR dynamically arbitrates conflicts via an uncertainty-aware ``soft projection'' mechanism, optimizing the signal-to-noise ratio. Extensive experiments demonstrate that PCR significantly smooths the training trajectory and achieves superior performance in various reasoning tasks.