DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
For practitioners of reinforcement learning for large language models, DARE addresses the limitations of difficulty-aware data selection by enabling adaptive compute allocation and tailored training strategies, leading to more concise responses on easy tasks and improved correctness on hard ones.
DARE introduces a unified reinforcement learning framework that co-evolves difficulty estimation with the policy, achieving consistent improvements in training efficiency, final performance, and inference efficiency across multiple models and domains.
Reinforcement learning improves the reasoning ability of large language models but remains costly and sample-inefficient, as many rollouts provide weak learning signals. Difficulty-aware data selection methods attempt to address this by prioritizing moderately difficult prompts, yet our analysis reveals three limitations: difficulty estimates become inaccurate under policy drift, data selection alone yields limited final-performance gains, and inference efficiency remains largely unchanged. These findings suggest that efficient and effective RL requires more than filtering by difficulty: the policy should learn to solve hard tasks while producing concise responses for easy ones. To this end, we propose **Dare**, a unified framework that co-evolves difficulty estimation with the policy via self-normalized importance sampling, maintains diverse difficulty coverage through a symmetric Beta sampling distribution, and applies tailored training strategies across difficulty tiers with adaptive compute allocation. Extensive experiments across multiple models and domains demonstrate that **Dare** consistently outperforms existing methods in training efficiency, final effectiveness, and inference efficiency, producing more concise responses on easy tasks while improving correctness on hard ones. Code is available at https://github.com/EtaYang10th/DARE.