CL AI IRSep 26, 2025

Retrieval-Augmented Guardrails for AI-Drafted Patient-Portal Messages: Error Taxonomy Construction and Large-Scale Evaluation

Wenyuan Chen, Fateme Nateghi Haredasht, Kameron C. Black, Francois Grolleau, Emily Alsentzer, Jonathan H. Chen, Stephen P. Ma

arXiv:2509.22565v16.73 citationsh-index: 4Pac Symp Biocomput Pac Symp Biocomput

Originality Incremental advance

AI Analysis

This work addresses the need for robust AI guardrails to reduce clinician workload and ensure safety in patient messaging, though it is incremental as it builds on existing LLM evaluation techniques with a domain-specific focus.

The paper tackled the problem of evaluating large language model (LLM)-drafted patient-clinician messages for clinical inaccuracies and tone mismatches by developing a retrieval-augmented evaluation pipeline, which improved error identification and achieved superior human agreement (concordance = 50% vs. 33%) and performance (F1 = 0.500 vs. 0.256) compared to baseline methods.

Asynchronous patient-clinician messaging via EHR portals is a growing source of clinician workload, prompting interest in large language models (LLMs) to assist with draft responses. However, LLM outputs may contain clinical inaccuracies, omissions, or tone mismatches, making robust evaluation essential. Our contributions are threefold: (1) we introduce a clinically grounded error ontology comprising 5 domains and 59 granular error codes, developed through inductive coding and expert adjudication; (2) we develop a retrieval-augmented evaluation pipeline (RAEC) that leverages semantically similar historical message-response pairs to improve judgment quality; and (3) we provide a two-stage prompting architecture using DSPy to enable scalable, interpretable, and hierarchical error detection. Our approach assesses the quality of drafts both in isolation and with reference to similar past message-response pairs retrieved from institutional archives. Using a two-stage DSPy pipeline, we compared baseline and reference-enhanced evaluations on over 1,500 patient messages. Retrieval context improved error identification in domains such as clinical completeness and workflow appropriateness. Human validation on 100 messages demonstrated superior agreement (concordance = 50% vs. 33%) and performance (F1 = 0.500 vs. 0.256) of context-enhanced labels vs. baseline, supporting the use of our RAEC pipeline as AI guardrails for patient messaging.

View on arXiv PDF

Similar