DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
This addresses the problem of limited training data for advancing reasoning in AI, particularly for researchers and practitioners in machine learning, though it is incremental as it builds on existing dataset creation efforts.
The authors tackled the lack of large-scale, challenging, contamination-free, and verifiable mathematical training data for reinforcement learning with large language models by introducing DeepMath-103K, a dataset that achieves state-of-the-art results on challenging benchmarks and demonstrates generalization to domains like biology, physics, and chemistry.
Reinforcement learning (RL) with large language models shows promise in complex reasoning. However, its progress is hindered by the lack of large-scale training data that is sufficiently challenging, contamination-free and verifiable. To this end, we introduce DeepMath-103K, a large-scale mathematical dataset designed with high difficulty (primarily levels 5-9), rigorous decontamination against numerous benchmarks, and verifiable answers for rule-based RL reward. It further includes three distinct R1 solutions adaptable for diverse training paradigms such as supervised fine-tuning (SFT). Spanning a wide range of mathematical topics, DeepMath-103K fosters the development of generalizable and advancing reasoning. Notably, models trained on DeepMath-103K achieve state-of-the-art results on challenging mathematical benchmarks and demonstrate generalization beyond math such as biology, physics and chemistry, underscoring its broad efficacy. Data: https://huggingface.co/datasets/zwhe99/DeepMath-103K.