Irreducible Curriculum for Language Model Pretraining
This work addresses the problem of efficient and fine-grained data selection for language model training, offering a novel approach that reduces computational costs while improving performance, though it is incremental in the context of curriculum learning methods.
The paper tackled the challenge of automatic data selection and curriculum design for large language model pretraining by proposing an irreducible curriculum algorithm that prioritizes samples with higher learnability, using a proxy model to avoid extra computation. The method demonstrated consistent improvements in validation perplexity across domains and better 5-shot accuracy on benchmarks compared to baselines.
Automatic data selection and curriculum design for training large language models is challenging, with only a few existing methods showing improvements over standard training. Furthermore, current schemes focus on domain-level selection, overlooking the more fine-grained contributions of each individual training point. It is difficult to apply traditional datapoint selection methods on large language models: most online batch selection methods perform two-times forward or backward passes, which introduces considerable extra costs with large-scale models. To mitigate these obstacles, we propose irreducible curriculum as a curriculum learning algorithm for language model pretraining, which prioritizes samples with higher learnability. Specifically, to avoid prohibitive extra computation overhead, we simulate the sample loss along the main model's training trajectory using a small-scale proxy model. Our experiments on the RedPajama-1B dataset demonstrate a consistent improvement on validation perplexity across all 7 domains compared to random uniform baseline and the anti-curriculum strategy. Our method also reduces the sharpness of the network and illustrates a better 5-shot accuracy on MMLU benchmarks.