CL AIMay 14

Quantifying and Mitigating Premature Closure in Frontier LLMs

Rebecca Handler, Suhana Bedi, Nigam Shah

arXiv:2605.1500025.1

Predicted impact top 42% in CL · last 90 daysOriginality Incremental advance

AI Analysis

Identifies a critical safety gap in medical LLMs for healthcare practitioners, though the mitigation approach is incremental.

The paper defines and measures premature closure in frontier LLMs, finding high false-action rates (55-82%) in medical QA tasks and 30-78% inappropriate answers in open-ended evaluations. Safety-oriented prompting reduced but did not eliminate the issue.

Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate commitment under uncertainty: providing an answer, recommendation, or clinical guidance when the safer response would be clarification, abstention, escalation, or refusal. We evaluated five frontier LLMs across structured and open-ended medical tasks. In MedQA (n = 500) and AfriMed-QA (n = 490) questions where the correct choice had been removed, models still selected an answer at high rates, with baseline false-action rates of 55-81% and 53-82%, respectively. In open-ended evaluation, models gave inappropriate answers on an average of 30% of 861 HealthBench questions and 78% of 191 physician-authored adversarial queries. Safety-oriented prompting reduced premature closure across models, but residual failure persisted, highlighting the need to evaluate whether medical LLMs know when not to answer.

View on arXiv PDF

Similar