CLAIOct 31, 2025

MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models

arXiv:2510.27267v1h-index: 5Has Code
Originality Incremental advance
AI Analysis

This addresses a critical gap in evaluating LLMs for clinical decision-making, though it is incremental as it builds on existing frameworks like InternBootcamp.

The authors tackled the lack of benchmarks for medical calculation abilities in large language models by introducing MedCalc-Eval, a comprehensive dataset with 700+ tasks, and improved performance using MedCalc-Env, achieving state-of-the-art results with a fine-tuned Qwen2.5-32B model.

As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs' medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at https://github.com/maokangkun/MedCalc-Eval.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes