Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models

Adel Javanmard, Baharan Mirzasoleiman, Vahab Mirrokni

arXiv:2603.01293v13.82 citationsh-index: 53

Originality Incremental advance

AI Analysis

This work provides theoretical insights into data quality and synergistic effects in pre- and post-training reasoning models, which is incremental for optimizing training strategies in large language models.

The authors tackled the problem of understanding why pretraining and reinforcement learning require large datasets while supervised fine-tuning excels on smaller ones, and what defines high-quality data for fine-tuning, by theoretically analyzing transformers on a linear regression task and validating with experiments on large nonlinear transformers, finding that balanced pretraining induces latent capabilities, SFT learns best from small challenging datasets, and RL benefits from large-scale, not overly difficult data.

Large Language Models (LLMs) are pretrained on massive datasets and later instruction-tuned via supervised fine-tuning (SFT) or reinforcement learning (RL). Best practices emphasize large, diverse pretraining data, whereas post-training operates differently: SFT relies on smaller, high-quality datasets, while RL benefits more from scale, with larger amounts of feedback often outweighing label quality. Yet it remains unclear why pretraining and RL require large datasets, why SFT excels on smaller ones, and what defines high-quality SFT data. In this work, we theoretically analyze transformers trained on an in-context weight prediction task for linear regression. Our analysis reveals several key findings: $(i)$ balanced pretraining data can induce latent capabilities later activated during post-training, and $(ii)$ SFT learns best from a small set of examples challenging for the pretrained model, while excessively large SFT datasets may dilute informative pretraining signals. In contrast, RL is most effective on large-scale data that is not overly difficult for the pretrained model. We validate these theoretical insights with experiments on large nonlinear transformer architectures.

View on arXiv PDF

Similar