MathPile: A Billion-Token-Scale Pretraining Corpus for Math
This provides a domain-specific resource for researchers and developers aiming to improve mathematical reasoning in language models, though it is incremental as it focuses on data curation rather than novel methods.
The authors tackled the problem of limited high-quality math data for pretraining language models by introducing MathPile, a 9.5-billion-token corpus, which boosted performance on mathematical reasoning benchmarks through continual pretraining.
High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of "less is more", firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates and conducted continual pre-training experiments, booting the performance on common mathematical reasoning benchmarks. We aim for our MathPile to boost language models' mathematical reasoning abilities and open-source its different versions and processing scripts to advance the field.