Che Ngufor

18.4AIJun 3

PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

Keqi Han, Ryan Young, Annabel Strauss et al.

Patient safety event triage, determining whether a clinical event is reportable under jurisdiction-specific policy, is a high-stakes task typically performed manually by patient safety experts. Although LLMs may support this workflow, reliable evaluation is limited by the lack of benchmarks to capture evidence-grounded policy reasoning, proactive information seeking for incomplete reports, and principled abstention in irreducibly ambiguous cases. We address this gap with a policy-grounded construction methodology centered on the clause card, a structured representation that factorizes regulatory text into auditable decision specifications. Combining clause cards with anchor-driven instantiation and closed-loop verification, our scalable pipeline produces narratives with by-construction ground truth and naturally supports generating missing information and uncertain variants. We instantiate this method on Minnesota's 29 Reportable Adverse Health Events, producing PSEBench, a 5,074-case benchmark with an agentic evaluation environment. Evaluation on 15 representative LLMs reveals consistent capability trends, demonstrates the benchmark's utility, and identifies actionable gaps toward reliable LLM-based patient safety event triage.

CVMar 15, 2017

Transfer Learning for Melanoma Detection: Participation in ISIC 2017 Skin Lesion Classification Challenge

Dennis H. Murphree, Che Ngufor

This manuscript describes our participation in the International Skin Imaging Collaboration's 2017 Skin Lesion Analysis Towards Melanoma Detection competition. We participated in Part 3: Lesion Classification. The two stated goals of this binary image classification challenge were to distinguish between (a) melanoma and (b) nevus and seborrheic keratosis, followed by distinguishing between (a) seborrheic keratosis and (b) nevus and melanoma. We chose a deep neural network approach with a transfer learning strategy, using a pre-trained Inception V3 network as both a feature extractor to provide input for a multi-layer perceptron as well as fine-tuning an augmented Inception network. This approach yielded validation set AUC's of 0.84 on the second task and 0.76 on the first task, for an average AUC of 0.80. We joined the competition unfortunately late, and we look forward to improving on these results.

Che Ngufor

2 Papers