AILGMay 23, 2025

Stable Reinforcement Learning for Efficient Reasoning

arXiv:2505.18086v129 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses efficiency and stability issues in reinforcement learning for large language models, particularly in reasoning tasks, but is incremental as it builds on existing GRPO methods.

The paper tackles the problem of training instability in reinforcement learning for reasoning tasks caused by length-penalty reward functions, which lead to accuracy collapse. The result is a method that avoids instability while improving average accuracy by 1.48% and reducing reasoning sequence length by 47.3% across multiple benchmarks.

The success of Deepseek-R1 has drawn the LLM community's attention to reinforcement learning (RL) methods like GRPO. However, such rule-based 0/1 outcome reward methods lack the capability to regulate the intermediate reasoning processes during chain-of-thought (CoT) generation, leading to severe overthinking phenomena. In response, recent studies have designed reward functions to reinforce models' behaviors in producing shorter yet correct completions. Nevertheless, we observe that these length-penalty reward functions exacerbate RL training instability: as the completion length decreases, model accuracy abruptly collapses, often occurring early in training. To address this issue, we propose a simple yet effective solution GRPO-$λ$, an efficient and stabilized variant of GRPO, which dynamically adjusts the reward strategy by monitoring the correctness ratio among completions within each query-sampled group. A low correctness ratio indicates the need to avoid length penalty that compromises CoT quality, triggering a switch to length-agnostic 0/1 rewards that prioritize reasoning capability. A high ratio maintains length penalties to boost efficiency. Experimental results show that our approach avoids training instability caused by length penalty while maintaining the optimal accuracy-efficiency trade-off. On the GSM8K, GPQA, MATH-500, AMC 2023, and AIME 2024 benchmarks, it improves average accuracy by 1.48% while reducing CoT sequence length by 47.3%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes