AI CL IR LOJan 28, 2025

VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records

Philip Chung, Akshay Swaminathan, Alex J. Goodell, Yeasul Kim, S. Momsen Reincke, Lichy Han, Ben Deverett, Mohammad Amin Sadeghi, Abdel-Badih Ariss, Marc Ghanem, David Seong, Andrew A. Lee

arXiv:2501.16672v122.919 citationsh-index: 44Has Code

Originality Incremental advance

AI Analysis

This addresses a critical bottleneck for developing LLM-based applications in clinical medicine by providing a method to verify facts against patient records, though it is incremental as it builds on existing techniques like retrieval-augmented generation.

The paper tackles the problem of ensuring factual accuracy in LLM-generated clinical text by introducing VeriFact, a system that verifies text against electronic health records, achieving up to 92.7% agreement with human ground truth, exceeding average clinician agreement of 88.5%.

Methods to ensure factual accuracy of text generated by large language models (LLM) in clinical medicine are lacking. VeriFact is an artificial intelligence system that combines retrieval-augmented generation and LLM-as-a-Judge to verify whether LLM-generated text is factually supported by a patient's medical history based on their electronic health record (EHR). To evaluate this system, we introduce VeriFact-BHC, a new dataset that decomposes Brief Hospital Course narratives from discharge summaries into a set of simple statements with clinician annotations for whether each statement is supported by the patient's EHR clinical notes. Whereas highest agreement between clinicians was 88.5%, VeriFact achieves up to 92.7% agreement when compared to a denoised and adjudicated average human clinican ground truth, suggesting that VeriFact exceeds the average clinician's ability to fact-check text against a patient's medical record. VeriFact may accelerate the development of LLM-based EHR applications by removing current evaluation bottlenecks.

View on arXiv PDF Code

Similar