LGCLMar 31, 2025

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

arXiv:2503.24290v2447 citationsh-index: 32Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of making large-scale reasoning RL more accessible and efficient for the AI research community, though it is incremental as it builds on existing methods like DeepSeek-R1-Zero.

The paper tackled scaling up reinforcement learning for reasoning tasks on base models by introducing Open-Reasoner-Zero, an open-source implementation that uses a minimalist approach with vanilla PPO and rule-based rewards, achieving superior performance on benchmarks like AIME2024, MATH500, and GPQA Diamond while requiring only 1/10 of the training steps compared to prior work.

We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training on the base model focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($λ=1$, $γ=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length, replicating the scaling phenomenon observed in DeepSeek-R1-Zero. Using the same base model, Qwen2.5-32B base, as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance across AIME2024, MATH500, and GPQA Diamond, while demonstrating remarkable efficiency, requiring only 1/10 of the training steps compared to the DeepSeek-R1-Zero pipeline. Moreover, our analysis not only covers training dynamics and ablation for critical design choices, but also quantitatively shows how the learned critic in Reasoner-Zero training effectively identifies and devalues repetitive response patterns, yielding more robust advantage estimations and enhancing training stability. Embracing the principles of open-source, we release our source code, training data, and various model weights, fostering reproducibility and encouraging further exploration of the properties of related models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes