CLJun 9, 2025

Reinforcement Pre-Training

Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, Furu Wei

Tsinghua

arXiv:2506.08007v127.433 citationsh-index: 11

Originality Highly original

AI Analysis

This proposes a new paradigm for advancing language model pre-training, potentially impacting the entire field of large language models and reinforcement learning.

The paper tackles the problem of scaling large language models by introducing Reinforcement Pre-Training (RPT), which reframes next-token prediction as a reasoning task trained with reinforcement learning using verifiable rewards from text data. The results show that RPT significantly improves language modeling accuracy for next-token prediction, with scaling curves demonstrating consistent accuracy gains from increased training compute.

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.

View on arXiv PDF

Similar