AICLCVLGJan 30, 2025

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Tsinghua
arXiv:2501.18362v3169 citationsh-index: 35Has CodeICML
Originality Incremental advance
AI Analysis

This benchmark addresses the need for more rigorous evaluation of expert-level medical knowledge and reasoning in AI models, though it is incremental as it builds upon existing medical QA benchmarks.

The authors introduced MedXpertQA, a challenging benchmark with 4,460 expert-level medical questions across 17 specialties and 11 body systems, including multimodal and reasoning subsets, and evaluated 18 leading models to assess advanced medical reasoning.

We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 18 leading models on \benchmark. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models. Code and data are available at: https://github.com/TsinghuaC3I/MedXpertQA

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes