From $\boldsymbol{\logÏ}$ to $\boldsymbolÏ$: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight
This addresses a critical stability issue in RLVR optimization for large language models, though it appears incremental as an improvement over existing soft clipping methods.
The paper tackles the problem of gradient divergence in soft clipping methods for reinforcement learning with verifiable rewards (RLVR) by proposing Decoupled Gradient Policy Optimization (DGPO), which uses probability gradients instead of log-probability gradients and applies a decoupled decay mechanism. Experiments on DeepSeek-R1-Distill-Qwen models (1.5B/7B/14B) show DGPO consistently outperforms baselines on mathematical benchmarks.
Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via ``hard clipping'', which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recent ``soft clipping'' methods attempt to recover these gradients, they suffer from a critical challenge: relying on log-probability gradient ($\nabla_θ\log Ï_θ$) yields divergent weights as probabilities vanish, destabilizing LLM training. We rethink this convention by establishing probability gradient ($\nabla_θÏ_θ$) as the superior optimization primitive. Accordingly, we propose Decoupled Gradient Policy Optimization (DGPO), which employs a decoupled decay mechanism based on importance sampling ratios. By applying asymmetric, continuous decay to boundary tokens, DGPO resolves the conflict between stability and sustained exploration. Extensive experiments across DeepSeek-R1-Distill-Qwen series models (1.5B/7B/14B) demonstrate that DGPO consistently outperforms strong baselines on various mathematical benchmarks, offering a robust and scalable solution for RLVR. Our code and implementation are available at: https://github.com/VenomRose-Juri/DGPO-RL.