CLDec 18, 2025

JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, Ning Ding, Zhiyuan Liu

arXiv:2512.16649v120 citationsh-index: 32

Originality Highly original

AI Analysis

This work addresses the issue of overcomplicated RL pipelines for scaling LLMs, offering a simpler, more efficient baseline for researchers and practitioners.

The paper tackles the problem of unnecessary complexity in reinforcement learning for large language models by presenting JustRL, a minimal single-stage training approach with fixed hyperparameters, which achieves state-of-the-art performance (54.9% and 64.3% average accuracy on mathematical benchmarks) while using 2× less compute than complex methods.

Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbf{Is this complexity necessary?} We present \textbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2$\times$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.

View on arXiv PDF

Similar