CLOct 10, 2025

DARO: Difficulty-Aware Reweighting Policy Optimization

arXiv:2510.09001v14 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in RLVR methods for LLMs, offering an incremental improvement over existing approaches like GRPO.

The paper tackles the problem of static weighting schemes in reinforcement learning for large language models, which cause loss scale issues and hinder performance, by introducing DARO, a method that dynamically adjusts loss contributions based on difficulty, resulting in faster convergence and superior performance on math benchmarks.

Recent advances in large language models (LLMs) have shown that reasoning ability can be significantly enhanced through Reinforcement Learning with Verifiable Rewards (RLVR). Group Relative Policy Optimization (GRPO) has emerged as the de facto approach for RLVR, inspiring numerous variants. However, our mathematical analysis reveals that these methods are fundamentally weighted variations of GRPO. We provide a unified view, demonstrating that their reliance on static or overly simplistic weighting schemes tied to sample difficulty prevents adaptation to a model's evolving capabilities. This creates a significant loss scale issue, where training disproportionately focuses on certain difficulty levels at the expense of others, hindering overall performance. To address these limitations, we introduce \textbf{Difficulty-Aware Reweighting Policy Optimization (DARO)}, a method that dynamically adjusts the loss contribution of each difficulty group based on the model's learning state. Extensive experiments on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Llama3.1-8B show that DARO outperforms four leading baselines across six math benchmarks, achieving significantly faster convergence and superior final performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes