LGAIMay 14

RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

arXiv:2605.1454327.8
Predicted impact top 10% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For researchers evaluating LLMs in clinical decision-making, RxEval provides a more realistic and challenging benchmark that exposes limitations in current models' ability to handle time-ordered clinical trajectories and specific medication choices.

Existing medication recommendation benchmarks use coarse admission-level prediction, failing to capture real prescribing dynamics. RxEval introduces a prescription-level benchmark with 1,547 multiple-choice questions requiring selection of specific medication-dose-route triples; evaluation of 16 LLMs shows F1 scores from 45.18 to 77.10 and best Exact Match of 46.10%, revealing that even frontier models overlook patient information and fail to derive clinical conclusions.

Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes