MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries
This work addresses the critical need for scalable quality assurance in generative AI for clinical workflows, providing tools to enhance factual accuracy and adoption.
The authors tackled the problem of evaluating factual accuracy in LLM-generated clinical text by introducing MedFactEval, a framework using an LLM Jury for scalable assessment, which achieved almost perfect agreement with a physician panel (Cohen's kappa=81%). They also presented MedAgentBrief, a workflow for generating factual discharge summaries, offering a comprehensive approach for responsible AI deployment in clinical settings.
Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an "LLM Jury"--a multi-LLM majority vote--assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen's kappa=81%), a performance statistically non-inferior to that of a single human expert (kappa=67%, P < 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.