LG AI CLMar 24, 2025

Reasoning to Learn from Latent Thoughts

Yangjun Ruan, Neil Band, Chris J. Maddison, Tatsunori Hashimoto

DeepMindU of Toronto

arXiv:2503.18866v233.136 citationsh-index: 30

Originality Incremental advance

AI Analysis

This addresses data constraints in scaling language models, particularly for math tasks, but is incremental as it builds on existing synthetic data and EM methods.

The paper tackles the problem of data scarcity in language model pretraining by proposing to infer latent thoughts underlying text generation, which improves data efficiency. They show that a 1B LM bootstraps its performance over multiple iterations, significantly outperforming baselines trained on raw data with gains from increased inference compute.

Compute scaling for language model (LM) pretraining has outpaced the growth of human-written texts, leading to concerns that data will become the bottleneck to LM scaling. To continue scaling pretraining in this data-constrained regime, we propose that explicitly modeling and inferring the \emph{latent thoughts} that underlie the text generation process can significantly improve pretraining data efficiency. Intuitively, our approach views web text as the compressed final outcome of a verbose human thought process and that the latent thoughts contain important contextual knowledge and reasoning steps that are critical to data-efficient learning. We empirically demonstrate the effectiveness of our approach through data-constrained continued pretraining for math. We first show that synthetic data approaches to inferring latent thoughts significantly improve data efficiency over training on the same amount of raw data. Furthermore, we demonstrate latent thought inference without a strong teacher, where an LM \emph{bootstraps its own performance} by using an EM algorithm to iteratively improve the capability of the trained LM and the quality of thought-augmented pretraining data. We show that a 1B LM can bootstrap its performance across at least three iterations and significantly outperform baselines trained on raw data, with increasing gains from additional inference compute when performing the E-step. The gains from inference scaling and EM iterations suggest new opportunities for scaling data-constrained pretraining.

View on arXiv PDF

Similar