Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization
This addresses the need for trustworthy and interpretable summarization in clinical settings, though it is incremental in leveraging existing attention mechanisms.
The paper tackled the problem of providing transparent source attribution in clinical summarization by proposing a training-free framework that uses decoder attentions to cite supporting text or images, achieving improvements such as a +15% F1 score over embedding baselines.
Trustworthy clinical summarization requires not only fluent generation but also transparency about where each statement comes from. We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images, overcoming the limitations of post-hoc or retraining-based methods. We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment. Evaluations on two representative domains: clinician-patient dialogues (CliConSummation) and radiology reports (MIMIC-CXR), show that our approach consistently outperforms embedding-based and self-attribution baselines, improving both text-level and multimodal attribution accuracy (e.g., +15% F1 over embedding baselines). Caption-based attribution achieves competitive performance with raw-image attention while being more lightweight and practical. These findings highlight attention-guided attribution as a promising step toward interpretable and deployable clinical summarization systems.