Do Clinical Models Change Treatment Decisions?
For medical AI developers, this work identifies a critical gap in evaluating clinical models—treatment decision-making under changing contexts—and proposes a benchmark to address it.
ClinPivot reveals that strong medical QA performance does not reliably predict treatment decision-making; frontier models often fail to adjust decisions when patient context changes, and decision-structured supervision improves pivot-sensitive decision-making.
Clinical foundation models are evaluated with factual or exam-style medical QA, but treatment decisions must change when patient context changes. We introduce ClinPivot, an auditable treatment-decision benchmark built from biomedical relations and pivoted patient contexts. ClinPivot asks whether models change treatment choices when new clinical constraints shift the action space. We find that strong medical QA performance does not reliably predict decision-making performance: frontier models and task-adapted Qwen variants often fail to change decisions correctly, and model rankings shift across evaluation regimes. Decision-structured supervision improves pivot-sensitive decision-making and medical QA under matched knowledge budgets, while lightweight replay reduces losses in general assistant ability.