CVApr 9

Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models

Sumra Khan, Sagar Chhabriya, Aizan Zafar, Sheeraz Arif, Amgad Muneer, Anas Zafar, Shaina Raza, Rizwan Qureshi

arXiv:2604.0881571.9h-index: 6

Predicted impact top 40% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For medical AI practitioners, this work addresses the problem of weakly grounded conclusions in multimodal medical reasoning by enforcing multi-evidence agreement, improving reliability and trustworthiness.

The paper introduces a context-aligned reasoning framework for medical VLMs that enforces agreement across heterogeneous clinical evidence before generating diagnostic conclusions. On chest X-ray datasets, it improves AUC from 0.918 to 0.925, reduces hallucinated keywords from 1.14 to 0.25, and produces more concise explanations (19.4 to 15.3 words) without increasing model confidence.

Medical vision-language models (VLMs) show strong performance on radiology tasks but often produce fluent yet weakly grounded conclusions due to over-reliance on a dominant modality. We introduce a context-aligned reasoning framework that enforces agreement across heterogeneous clinical evidence before generating diagnostic conclusions. The proposed approach augments a frozen VLM with structured contextual signals derived from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. Instead of producing free-form responses, the model generates structured outputs containing supporting evidence, uncertainty estimates, limitations, and safety notes. We observe that auxiliary signals alone provide limited benefit; performance gains emerge only when these signals are integrated through contextual verification. Experiments on chest X-ray datasets demonstrate that context alignment improves discriminative performance (AUC 0.918 to 0.925) while maintaining calibrated uncertainty. The framework also substantially reduces hallucinated keywords (1.14 to 0.25) and produces more concise reasoning explanations (19.4 to 15.3 words) without increasing model confidence (0.70 to 0.68). Cross-dataset evaluation on CheXpert further reveals that modality informativeness significantly influences reasoning behavior. These results suggest that enforcing multi-evidence agreement improves both reliability and trustworthiness in medical multimodal reasoning, while preserving the underlying model architecture.

View on arXiv PDF

Similar