CLAIIRSep 26, 2025

Retrieval-Augmented Guardrails for AI-Drafted Patient-Portal Messages: Error Taxonomy Construction and Large-Scale Evaluation

arXiv:2509.22565v13 citationsh-index: 4Pac Symp Biocomput Pac Symp Biocomput
Originality Incremental advance
AI Analysis

This work addresses the need for robust AI guardrails to reduce clinician workload and ensure safety in patient messaging, though it is incremental as it builds on existing LLM evaluation techniques with a domain-specific focus.

The paper tackled the problem of evaluating large language model (LLM)-drafted patient-clinician messages for clinical inaccuracies and tone mismatches by developing a retrieval-augmented evaluation pipeline, which improved error identification and achieved superior human agreement (concordance = 50% vs. 33%) and performance (F1 = 0.500 vs. 0.256) compared to baseline methods.

Asynchronous patient-clinician messaging via EHR portals is a growing source of clinician workload, prompting interest in large language models (LLMs) to assist with draft responses. However, LLM outputs may contain clinical inaccuracies, omissions, or tone mismatches, making robust evaluation essential. Our contributions are threefold: (1) we introduce a clinically grounded error ontology comprising 5 domains and 59 granular error codes, developed through inductive coding and expert adjudication; (2) we develop a retrieval-augmented evaluation pipeline (RAEC) that leverages semantically similar historical message-response pairs to improve judgment quality; and (3) we provide a two-stage prompting architecture using DSPy to enable scalable, interpretable, and hierarchical error detection. Our approach assesses the quality of drafts both in isolation and with reference to similar past message-response pairs retrieved from institutional archives. Using a two-stage DSPy pipeline, we compared baseline and reference-enhanced evaluations on over 1,500 patient messages. Retrieval context improved error identification in domains such as clinical completeness and workflow appropriateness. Human validation on 100 messages demonstrated superior agreement (concordance = 50% vs. 33%) and performance (F1 = 0.500 vs. 0.256) of context-enhanced labels vs. baseline, supporting the use of our RAEC pipeline as AI guardrails for patient messaging.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes