LG AIMay 14

RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

Shuhao Chen, Weisen Jiang, Changmiao Wang, Xiaoqing Wu, Xuanren Shi, Yu Zhang, James T. Kwok

arXiv:2605.1454327.8

Predicted impact top 10% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers evaluating LLMs in clinical decision-making, RxEval provides a more realistic and challenging benchmark that exposes limitations in current models' ability to handle time-ordered clinical trajectories and specific medication choices.

Existing medication recommendation benchmarks use coarse admission-level prediction, failing to capture real prescribing dynamics. RxEval introduces a prescription-level benchmark with 1,547 multiple-choice questions requiring selection of specific medication-dose-route triples; evaluation of 16 LLMs shows F1 scores from 45.18 to 77.10 and best Exact Match of 46.10%, revealing that even frontier models overlook patient information and fail to derive clinical conclusions.

Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.

View on arXiv PDF

Similar