MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting
This work addresses the high training costs for researchers and practitioners using GRPO-style methods in mathematical reasoning, representing an incremental improvement in efficiency.
The paper tackles the computational expense of Group Relative Policy Optimization (GRPO) training for mathematical reasoning models by proposing MMR-GRPO, which uses Maximal Marginal Relevance to reweigh rewards based on completion diversity, resulting in an average reduction of 47.9% in training steps and 70.2% in wall-clock time while maintaining comparable peak performance.
Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. We will release our code, trained models, and experimental protocols.