CLApr 13, 2025

GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models

arXiv:2504.09696v234.487 citationsh-index: 3Has CodeEMNLP

Originality Incremental advance

AI Analysis

This addresses the challenge of concise and accurate mathematical reasoning in language models, representing an incremental improvement over existing GRPO methods.

The paper tackled the problem of reward sparsity, verbosity, and inadequate focus on problem difficulty in Group Relative Policy Optimization (GRPO) for mathematical reasoning in language models, resulting in GRPO-LEAD, which significantly improves reasoning accuracy, conciseness, and efficiency, achieving state-of-the-art performance for 14B-scale models.

Group Relative Policy Optimization (GRPO), which is widely adopted by R1-like reasoning models, has advanced mathematical reasoning. Nevertheless, GRPO faces challenges in reward sparsity, verbosity, and inadequate focus on problem difficulty. We propose GRPO-LEAD, enhancing GRPO with: (1) length-regularized rewards to encourage conciseness while maintaining accuracy; (2) explicit penalties for incorrect solutions to improve model precision; and (3) difficulty-aware advantage reweighting for robust generalization on challenging problems. Comprehensive evaluations demonstrate that GRPO-LEAD significantly improves reasoning accuracy, conciseness, and efficiency. Our approach achieves state-of-the-art performance for 14B-scale models, underscoring the synergy of our methods with appropriate model scale and high-quality data. Our source code, generated dataset, and models are available at https://github.com/aeroplanepaper/GRPO-LEAD.

View on arXiv PDF Code

Similar