CLFeb 18, 2025

NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Ilia Kulikov, Kyunghyun Cho, Dong Wang, Yuandong Tian, Jason E Weston, Xian Li

arXiv:2502.13124v434.265 citationsh-index: 21Has Code

Originality Synthesis-oriented

AI Analysis

This provides a scalable resource for advancing reasoning in AI across diverse domains, though it is incremental as it builds on existing dataset creation and knowledge distillation methods.

The authors tackled the lack of diverse and high-quality reasoning questions beyond traditional domains by introducing NaturalReasoning, a dataset of 2.8 million challenging questions across multiple fields, and demonstrated its utility in knowledge distillation experiments for eliciting and transferring reasoning capabilities from teacher models.

Scaling reasoning capabilities beyond traditional domains such as math and coding is hindered by the lack of diverse and high-quality questions. To overcome this limitation, we introduce a scalable approach for generating diverse and challenging reasoning questions, accompanied by reference answers. We present NaturalReasoning, a comprehensive dataset comprising 2.8 million questions that span multiple domains, including STEM fields (e.g., Physics, Computer Science), Economics, Social Sciences, and more. We demonstrate the utility of the questions in NaturalReasoning through knowledge distillation experiments which show that NaturalReasoning can effectively elicit and transfer reasoning capabilities from a strong teacher model. Furthermore, we demonstrate that NaturalReasoning is also effective for unsupervised self-training using external reward models or self-rewarding. To foster future work, we publicly release NaturalReasoning at https://huggingface.co/datasets/facebook/natural_reasoning.

View on arXiv PDF

Similar