LGOct 29, 2025

Token-Regulated Group Relative Policy Optimization for Stable Reinforcement Learning in Large Language Models

arXiv:2511.00066v11 citationsh-index: 11
Originality Incremental advance
AI Analysis

This addresses a critical issue for researchers and practitioners using reinforcement learning to enhance reasoning in large language models, though it is an incremental improvement over existing methods.

The paper tackles the problem of unstable training in reinforcement learning for large language models, where low-probability tokens dominate gradient updates, and introduces Token-Regulated Group Relative Policy Optimization (TR-GRPO) to mitigate this by weighting tokens based on predicted probability, resulting in consistent outperformance over GRPO across tasks like logic, math, and agentic reasoning.

Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful approach for strengthening the reasoning capabilities of large language models (LLMs). Among existing algorithms, Group Relative Policy Optimization (GRPO) has demonstrated strong performance, yet it suffers from a critical issue: low-probability tokens disproportionately dominate gradient updates due to their inherently large gradient magnitudes. This imbalance leads to unstable training and suppresses the contribution of high-probability tokens that are more reliable for learning. In this work, we introduce Token-Regulated Group Relative Policy Optimization (TR-GRPO), a simple yet effective extension of GRPO that assigns token-level weights positively correlated with the model's predicted probability. By downweighting low-probability tokens and emphasizing high-probability ones, TR-GRPO mitigates gradient over-amplification while preserving informative learning signals. Extensive experiments demonstrate that TR-GRPO consistently outperforms GRPO across RLVR tasks, including logic, math, and agentic reasoning, highlighting the importance of regulating token contributions during RL training and establishing TR-GRPO as a robust framework for enhancing LLM reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes