CLAug 27, 2025

Beyond Shallow Heuristics: Leveraging Human Intuition for Curriculum Learning

Vanessa Toborek, Sebastian Müller, Tim Selbach, Tamás Horváth, Christian Bauckhage

arXiv:2508.19873v12 citationsh-index: 4ICNLSP

Originality Incremental advance

AI Analysis

This work addresses the challenge of curriculum learning for language model pre-training, offering a practical method based on human intuition, though it is incremental as it builds on existing curriculum learning frameworks.

The paper tackled the problem of defining linguistic difficulty for curriculum learning in language models by using human-curated simple language from Simple Wikipedia as a signal. The result showed that structuring this data via a curriculum, especially when introduced first, consistently improved perplexity, particularly on simple language, while shallow heuristic-based curricula did not.

Curriculum learning (CL) aims to improve training by presenting data from "easy" to "hard", yet defining and measuring linguistic difficulty remains an open challenge. We investigate whether human-curated simple language can serve as an effective signal for CL. Using the article-level labels from the Simple Wikipedia corpus, we compare label-based curricula to competence-based strategies relying on shallow heuristics. Our experiments with a BERT-tiny model show that adding simple data alone yields no clear benefit. However, structuring it via a curriculum -- especially when introduced first -- consistently improves perplexity, particularly on simple language. In contrast, competence-based curricula lead to no consistent gains over random ordering, probably because they fail to effectively separate the two classes. Our results suggest that human intuition about linguistic difficulty can guide CL for language model pre-training.

View on arXiv PDF

Similar