EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease
This addresses the critical need for interpretable and clinically aligned AI in medical imaging, particularly for Alzheimer's disease, though it is incremental in combining existing grounding and distillation techniques.
The paper tackles the lack of transparency in deep learning models for Alzheimer's disease diagnosis by introducing EMAD, a vision-language framework that generates structured diagnostic reports with explicit grounding in multimodal evidence, achieving state-of-the-art diagnostic accuracy on the AD-MultiSense dataset.
Deep learning models for medical image analysis often act as black boxes, seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer's disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision-language framework that generates structured AD diagnostic reports in which each claim is explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence-Evidence-Anatomy (SEA) grounding mechanism: (i) sentence-to-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning-diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods. We will release code and grounding annotations to support future research in trustworthy medical vision-language models.