LGJun 24, 2025

Universal pre-training by iterated random computation

arXiv:2506.20057v17.11 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the data scarcity issue for machine learning practitioners by providing a universal pre-training method, though it builds incrementally on prior theoretical work.

The paper tackles the problem of pre-training models without real data by using randomly generated synthetic data, showing that this approach enables zero-shot in-context learning across datasets and improves with scale, with finetuning offering faster convergence and better generalization.

We investigate the use of randomly generated data for the sake of pre-training a model. We justify this approach theoretically from the perspective of algorithmic complexity, building on recent research that shows that sequence models can be trained to approximate Solomonoff induction. We derive similar, but complementary theoretical results. We show empirically that synthetically generated data can be used to pre-train a model before the data is seen. We replicate earlier results that models trained this way show zero-shot in-context learning across a variety of datasets, and that this performance improves with scale. We extend earlier results to real-world data, and show that finetuning a model after pre-training offers faster convergence and better generalization.

View on arXiv PDF Code

Similar