CLFeb 11

Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

arXiv:2602.11149v13 citationsh-index: 26
Originality Incremental advance
AI Analysis

This provides a practical, cost-effective approach for improving reasoning in language models, though it is incremental as it builds on existing fine-tuning methods.

The paper tackles the problem of supervised fine-tuning for reasoning language models by showing that repeating a small dataset for many epochs outperforms using a larger dataset for fewer epochs, with Olmo3-7B achieving 12-26 percentage point gains on benchmarks like AIME'24/25 and GPQA.

Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes