CLJul 10, 2025

Automating Expert-Level Medical Reasoning Evaluation of Large Language Models

Shuang Zhou, Wenya Xie, Jiaxi Li, Zaifu Zhan, Meijia Song, Han Yang, Cheyenna Espinoza, Lindsay Welton, Xinnie Mai, Yanwei Jin, Zidu Xu, Yuen-Hei Chung

arXiv:2507.07988v118.221 citationsh-index: 12npj Digital Medicine

Originality Incremental advance

AI Analysis

This provides a scalable and rigorous tool for assessing LLMs in clinical decision-making, addressing a critical gap for safe deployment in healthcare.

The authors tackled the problem of evaluating large language models' medical reasoning by introducing MedThink-Bench, a benchmark with 500 questions and expert rationales, and LLM-w-Ref, an evaluation framework that shows strong correlation with expert judgments and reveals smaller models outperforming larger ones.

As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs' medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. To address this, we introduce MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs' medical reasoning. MedThink-Bench comprises 500 challenging questions across ten medical domains, each annotated with expert-crafted step-by-step rationales. Building on this, we propose LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms to assess intermediate reasoning with expert-level fidelity while maintaining scalability. Experiments show that LLM-w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of-the-art LLMs, we find that smaller models (e.g., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI-o3). Overall, MedThink-Bench offers a foundational tool for evaluating LLMs' medical reasoning, advancing their safe and responsible deployment in clinical practice.

View on arXiv PDF

Similar