CLApr 5

MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

Zhichao Yang, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman

arXiv:2605.2019724.3

AI Analysis

Provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, enabling evaluation of LLMs for clinical NLP tasks.

MedicalBench is a benchmark for implicit medical concept extraction with evidence grounding, built from MIMIC-IV and ICD-10 codes. State-of-the-art LLMs show modest performance, highlighting the difficulty of extracting implicitly expressed concepts.

Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts instead of implicit concepts. We present MedicalBench, a benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note-concept pairs, coupled with sentence-level evidence identification. Built from MIMIC-IV discharge summaries and human-verified ICD-10 codes, the dataset is curated through a multi-stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence-level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.

View on arXiv PDF

Similar