LGAICLSep 29, 2025

Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs

arXiv:2509.25380v1h-index: 18
Originality Incremental advance
AI Analysis

This addresses the challenge of designing effective data curriculums for LLM training, offering a method to predict and optimize data placement, though it is incremental as it builds on existing curriculum learning concepts.

The paper tackled the problem of optimizing data placement in LLM training by introducing the training re-evaluation curve (TREC) to evaluate data retention based on when it was encountered, and showed that aligning high-quality data with TREC minima improves performance, as demonstrated in a 3.9B-parameter LLM trained on 900B tokens.

Data curriculums have become central to successful LLM training, yet principles governing optimal data placement remain unclear. We introduce the *training re-evaluation curve (TREC)*, a diagnostic that retrospectively evaluates training batches *using the final model weights*. The TREC characterizes how well a trained model retains training data as a function of *when* the data was encountered during training. Analyzing TRECs for models from 111M to 3.9B parameters, we show that placing high-quality data at low points on the TREC significantly improves performance. Importantly, while a TREC is initially observable only after training, we demonstrate it can be *predicted in advance* from AdamW's implicit EMA coefficients, enabling proactive curriculum design. By predicting TRECs for published training recipes, we explain prior ablations and reveal suboptimal data placements. We also align high-quality data with TREC minima in order to improve continual pre-training of a 3.9B-parameter LLM trained on 900B tokens.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes