CLJun 9, 2025

Reinforcement Pre-Training

Tsinghua
arXiv:2506.08007v133 citationsh-index: 11
Originality Highly original
AI Analysis

This proposes a new paradigm for advancing language model pre-training, potentially impacting the entire field of large language models and reinforcement learning.

The paper tackles the problem of scaling large language models by introducing Reinforcement Pre-Training (RPT), which reframes next-token prediction as a reasoning task trained with reinforcement learning using verifiable rewards from text data. The results show that RPT significantly improves language modeling accuracy for next-token prediction, with scaling curves demonstrating consistent accuracy gains from increased training compute.

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes