AIJun 3

PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

Keqi Han, Ryan Young, Annabel Strauss, Lindsey Hughes, Katharine M. Nesbitt, Nicole Schueler, Che Ngufor, Carl Yang, Yuan Xue, Zhijun Yin

arXiv:2606.0546371.1

Predicted impact top 45% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For patient safety experts and AI developers, this benchmark provides a controlled, verifiable testbed to assess LLMs' policy reasoning and safe abstention in high-stakes medical triage.

The paper introduces PSEBench, a 5,074-case benchmark for evaluating LLMs in patient safety event triage, built using a policy-grounded methodology with clause cards. Evaluation of 15 LLMs reveals consistent capability trends and identifies actionable gaps toward reliable triage.

Patient safety event triage, determining whether a clinical event is reportable under jurisdiction-specific policy, is a high-stakes task typically performed manually by patient safety experts. Although LLMs may support this workflow, reliable evaluation is limited by the lack of benchmarks to capture evidence-grounded policy reasoning, proactive information seeking for incomplete reports, and principled abstention in irreducibly ambiguous cases. We address this gap with a policy-grounded construction methodology centered on the clause card, a structured representation that factorizes regulatory text into auditable decision specifications. Combining clause cards with anchor-driven instantiation and closed-loop verification, our scalable pipeline produces narratives with by-construction ground truth and naturally supports generating missing information and uncertain variants. We instantiate this method on Minnesota's 29 Reportable Adverse Health Events, producing PSEBench, a 5,074-case benchmark with an agentic evaluation environment. Evaluation on 15 representative LLMs reveals consistent capability trends, demonstrates the benchmark's utility, and identifies actionable gaps toward reliable LLM-based patient safety event triage.

View on arXiv PDF

Similar