CLLGAug 21, 2025

Influence-driven Curriculum Learning for Pre-training on Limited Data

UW
arXiv:2508.15475v21 citationsh-index: 14Proceedings of the First BabyLM Workshop
Originality Incremental advance
AI Analysis

This work addresses the problem of improving pre-training efficiency for language models on limited data, representing an incremental advance in curriculum learning techniques.

The paper tackled the limited success of curriculum learning in pre-training language models by replacing human-centered difficulty metrics with training data influence scores, resulting in models outperforming random-order training by over 10 percentage points in benchmarks.

Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their \textit{training data influence}, a score which estimates the effect of individual training examples on the model's output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes