CLJan 21, 2025

MedS$^3$: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision

Shuyang Jiang, Yusheng Liao, Zhe Chen, Ya Zhang, Yanfeng Wang, Yu Wang

arXiv:2501.12051v320.416 citationsh-index: 18Has Code

Originality Highly original

AI Analysis

This addresses the need for versatile, credible, and efficient language models for clinical reasoning applications, representing a strong specific gain rather than a foundational breakthrough.

The paper tackled the problem of medical language models lacking robust clinical reasoning by proposing a self-evolving framework that uses Monte Carlo Tree Search and fine-grained supervision to improve reasoning capabilities. The result was a model that outperformed the previous state-of-the-art medical model by +6.45 accuracy points and surpassed 32B-scale general-purpose reasoning models by +8.57 points on eleven benchmarks.

Medical language models face critical barriers to real-world clinical reasoning applications. However, mainstream efforts, which fall short in task coverage, lack fine-grained supervision for intermediate reasoning steps, and rely on proprietary systems, are still far from a versatile, credible and efficient language model for clinical reasoning usage. To this end, we propose \mone, a self-evolving framework that imparts robust reasoning capabilities to small, deployable models. Starting with 8,000 curated instances sampled via a curriculum strategy across five medical domains and 16 datasets, we use a small base policy model to conduct Monte Carlo Tree Search (MCTS) for constructing rule-verifiable reasoning trajectories. Self-explored reasoning trajectories ranked by node values are used to bootstrap the policy model via reinforcement fine-tuning and preference learning. Moreover, we introduce a soft dual process reward model that incorporates value dynamics: steps that degrade node value are penalized, enabling fine-grained identification of reasoning errors even when the final answer is correct. Experiments on eleven benchmarks show that \mone outperforms the previous state-of-the-art medical model by +6.45 accuracy points and surpasses 32B-scale general-purpose reasoning models by +8.57 points. Additional empirical analysis further demonstrates that \mone achieves robust and faithful reasoning behavior.

View on arXiv PDF Code

Similar