LGAICLMar 26, 2025

Understanding R1-Zero-Like Training: A Critical Perspective

arXiv:2503.20783v21060 citationsh-index: 41Has Code
Originality Incremental advance
AI Analysis

This work addresses optimization inefficiencies in RL training for LLMs, offering a more efficient method for enhancing reasoning capabilities.

The paper critically analyzes R1-Zero-like training, identifying biases in base models and RL optimization, and proposes Dr. GRPO to improve token efficiency, achieving 43.3% accuracy on AIME 2024 with a 7B model.

DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes