CLAIMAJun 11, 2025

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

arXiv:2506.09513v322 citationsh-index: 37Has CodeEMNLP
Originality Synthesis-oriented
AI Analysis

This provides a high-quality dataset for advancing medical reasoning in AI, addressing a gap in clinical validation, though it is incremental as it builds on existing reasoning methods.

The authors tackled the lack of large-scale medical reasoning datasets by introducing ReasonMed, a 370k-example dataset generated via multi-agent processes, which enabled training models that set new benchmarks, such as ReasonMed-7B surpassing prior sub-10B models by 4.17% and exceeding LLaMA3.1-70B on PubMedQA by 4.60%.

Reasoning-based large language models have excelled in mathematics and programming, yet their potential in knowledge-intensive medical question answering remains underexplored and insufficiently validated in clinical contexts. To bridge this gap, we introduce ReasonMed, the largest medical reasoning dataset to date, comprising 370k high-quality examples distilled from 1.75 million initial reasoning paths generated by complementary LLMs and curated through a cost-efficient easy-medium-difficult (EMD) pipeline. ReasonMed is built through a multi-agent generation, verification, and refinement process, in which an Error Refiner improves reasoning paths by correcting error-prone steps identified by a verifier. Using ReasonMed, we investigate effective strategies for training medical reasoning models and find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results. Models trained on ReasonMed set a new benchmark: ReasonMed-7B surpasses the prior best sub-10B models by 4.17% and even exceeds LLaMA3.1-70B on PubMedQA by 4.60%. When scaled to ReasonMed-14B, it remains highly competitive, underscoring consistent scaling potential. The codes and datasets are available at https://github.com/YuSun-Work/ReasonMed.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes