CLAIMay 2

Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks

arXiv:2605.0141794.8h-index: 22Has Code
Predicted impact top 13% in CL · last 90 daysOriginality Incremental advance
AI Analysis

Provides a comprehensive, open-source evaluation suite to address benchmark saturation and data accessibility issues for medical LLM evaluation.

Medmarks introduces a fully open-source benchmark suite with 30 tasks for evaluating LLMs in medicine, testing 61 models across 71 configurations. Frontier reasoning models (e.g., Gemini 3 Pro Preview, GPT-5.1) achieve top performance, with proprietary models being more token-efficient, and medically fine-tuned models outperforming generalist ones.

Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavily depend on restricted datasets, or lack comprehensive model coverage. We introduce Medmarks, a fully open-source evaluation suite with 30 benchmarks spanning question answering, information extraction, medical calculations, and open-ended clinical reasoning. We perform a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM-as-a-Judge. Our results show that frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, & GPT-5.2) achieve the highest performance across both benchmarks, most frontier proprietary models are significantly more token efficient than open-weight alternatives, medically fine-tuned models outperform their generalist counterparts, and that models are susceptible to answer-order bias (particularly smaller models and Grok 4). A subset of our evals (Medmarks-T) can be directly used as reinforcement learning environments to post-train LLMs for medical reasoning. Code is available at https://github.com/MedARC-AI/Medmarks

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes