CL LGMar 11

VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization

Weixin Liu, Congning Ni, Qingyuan Song, Susannah L. Rose, Christopher Symons, Murat Kantarcioglu, Bradley A. Malin, Zhijun Yin

arXiv:2603.10494v13.6h-index: 57

Predicted impact top 85% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the problem of ensuring accuracy and completeness in clinical summarization for healthcare professionals, though it is incremental as it builds on existing alignment and verification methods.

The paper tackled the problem of generating clinically useful and faithful Brief Hospital Course narratives from EHR evidence, where LLM-based summarizers often introduce unsupported statements or omissions. The result was that VERI-DPO reduced Not Supported claim rates from 10.7% to 1.9% and from 11.6% to 6.4% using different judges, while improving validity from 76.7% to 82.5% and maintaining informative length.

Brief Hospital Course (BHC) narratives must be clinically useful yet faithful to fragmented EHR evidence. LLM-based clinical summarizers still introduce unsupported statements, and alignment can encourage omissions ("say-less" degeneration). We introduce VERI-DPO, which uses claim verification to mine preferences and distill them into the summarizer with Direct Preference Optimization (DPO). On MIMIC-III-Ext-VeriFact-BHC (100 ICU patients; patient-level splits), we train a retrieval-augmented verifier to label claim-evidence pairs as Supported, Not Supported, or Not Addressed via a single-token format. The verifier scores sentence-level claims from sampled BHC candidates and aggregates margins into a coverage-aware utility to mine length-controlled, contradiction-anchored preference pairs. On held-out patients, verifier-mined preferences separate candidates by contradiction density, and VERI-DPO reduces Not Supported claim rates from 10.7% to 1.9% (local verifier judge) and from 11.6% to 6.4% (GPT-4o judge), while improving validity from 76.7% to 82.5% and maintaining informative length.

View on arXiv PDF

Similar