LGJun 4, 2025

OpenThoughts: Data Recipes for Reasoning Models

CMU
arXiv:2506.04178v2166 citationsh-index: 48Has Code
Originality Incremental advance
AI Analysis

This addresses the need for accessible training data in AI reasoning, enabling broader research and development, though it is incremental as it builds on existing data generation methods.

The paper tackled the problem of limited public data for training reasoning models by creating open-source datasets, resulting in models like OpenThoughts3-7B that achieved state-of-the-art results, such as 53% on AIME 2025, with improvements of up to 20.5 percentage points over previous models.

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThoughts3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond - improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on https://openthoughts.ai.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes