CLMar 21, 2025

FastCuRL: Curriculum Reinforcement Learning with Stage-wise Context Scaling for Efficient Training R1-like Reasoning Models

Mingyang Song, Mao Zheng, Zheng Li, Wenjie Yang, Xuan Luo, Yue Pan, Feng Zhang

arXiv:2503.17287v621.314 citationsh-index: 7Has CodeEMNLP

Originality Incremental advance

AI Analysis

This work addresses training efficiency for developers of large language models, offering an incremental improvement through curriculum reinforcement learning with context scaling.

The paper tackles the challenge of training efficiency in large-scale reinforcement learning for reasoning models by controlling context length and curating training data, resulting in FastCuRL, which outperforms state-of-the-art models on benchmarks and achieves 49.6% accuracy on AIME 2024 with reduced training resources.

Improving training efficiency continues to be one of the primary challenges in large-scale Reinforcement Learning (RL). In this paper, we investigate how context length and the complexity of training data influence the RL scaling training process of R1-distilled reasoning models, e.g., DeepSeek-R1-Distill-Qwen-1.5B. Our experimental results reveal that: (1) simply controlling the context length and curating the training data based on the input prompt length can effectively improve the training efficiency of RL scaling, achieving better performance with more concise CoT; (2) properly scaling the context length helps mitigate entropy collapse; and (3) carefully choosing the context length facilitates achieving efficient LLM training and reasoning. Inspired by these insights, we propose FastCuRL, a curriculum RL framework with stage-wise context scaling to achieve efficient LLM training and reasoning. Extensive experimental results demonstrate that FastCuRL-1.5B-V3 significantly outperforms state-of-the-art reasoning models on five competition-level benchmarks and achieves 49.6% accuracy on AIME 2024. Furthermore, FastCuRL-1.5B-Preview surpasses DeepScaleR-1.5B-Preview on five benchmarks while only using a single node with 8 GPUs and a total of 50% of training steps.

View on arXiv PDF Code

Similar