CLDec 3, 2025

PretrainZero: Reinforcement Active Pretraining

arXiv:2512.03442v15 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses the problem of extending reinforcement learning to general reasoning without verifiable labels for AI researchers, though it appears incremental as it builds on existing pretraining and active learning concepts.

The paper tackles the bottleneck of reinforcement learning models relying on verifiable rewards in specific domains by proposing PretrainZero, a reinforcement active learning framework for general pretraining that improves Qwen3-4B-Base by 8.43, 5.96, and 10.60 points on MMLU-Pro, SuperGPQA, and math average benchmarks.

Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes