LGOct 29, 2025

Token-Regulated Group Relative Policy Optimization for Stable Reinforcement Learning in Large Language Models

Tue Le, Nghi D. Q. Bui, Linh Ngo Van, Trung Le

arXiv:2511.00066v111.41 citationsh-index: 11

Originality Incremental advance

AI Analysis

This addresses a critical issue for researchers and practitioners using reinforcement learning to enhance reasoning in large language models, though it is an incremental improvement over existing methods.

The paper tackles the problem of unstable training in reinforcement learning for large language models, where low-probability tokens dominate gradient updates, and introduces Token-Regulated Group Relative Policy Optimization (TR-GRPO) to mitigate this by weighting tokens based on predicted probability, resulting in consistent outperformance over GRPO across tasks like logic, math, and agentic reasoning.

Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful approach for strengthening the reasoning capabilities of large language models (LLMs). Among existing algorithms, Group Relative Policy Optimization (GRPO) has demonstrated strong performance, yet it suffers from a critical issue: low-probability tokens disproportionately dominate gradient updates due to their inherently large gradient magnitudes. This imbalance leads to unstable training and suppresses the contribution of high-probability tokens that are more reliable for learning. In this work, we introduce Token-Regulated Group Relative Policy Optimization (TR-GRPO), a simple yet effective extension of GRPO that assigns token-level weights positively correlated with the model's predicted probability. By downweighting low-probability tokens and emphasizing high-probability ones, TR-GRPO mitigates gradient over-amplification while preserving informative learning signals. Extensive experiments demonstrate that TR-GRPO consistently outperforms GRPO across RLVR tasks, including logic, math, and agentic reasoning, highlighting the importance of regulating token contributions during RL training and establishing TR-GRPO as a robust framework for enhancing LLM reasoning.

View on arXiv PDF

Similar