AIJun 3

PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

arXiv:2606.0546371.1
Predicted impact top 45% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For patient safety experts and AI developers, this benchmark provides a controlled, verifiable testbed to assess LLMs' policy reasoning and safe abstention in high-stakes medical triage.

The paper introduces PSEBench, a 5,074-case benchmark for evaluating LLMs in patient safety event triage, built using a policy-grounded methodology with clause cards. Evaluation of 15 LLMs reveals consistent capability trends and identifies actionable gaps toward reliable triage.

Patient safety event triage, determining whether a clinical event is reportable under jurisdiction-specific policy, is a high-stakes task typically performed manually by patient safety experts. Although LLMs may support this workflow, reliable evaluation is limited by the lack of benchmarks to capture evidence-grounded policy reasoning, proactive information seeking for incomplete reports, and principled abstention in irreducibly ambiguous cases. We address this gap with a policy-grounded construction methodology centered on the clause card, a structured representation that factorizes regulatory text into auditable decision specifications. Combining clause cards with anchor-driven instantiation and closed-loop verification, our scalable pipeline produces narratives with by-construction ground truth and naturally supports generating missing information and uncertain variants. We instantiate this method on Minnesota's 29 Reportable Adverse Health Events, producing PSEBench, a 5,074-case benchmark with an agentic evaluation environment. Evaluation on 15 representative LLMs reveals consistent capability trends, demonstrates the benchmark's utility, and identifies actionable gaps toward reliable LLM-based patient safety event triage.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes