AILGSep 22, 2025

Multimodal Health Risk Prediction System for Chronic Diseases via Vision-Language Fusion and Large Language Models

arXiv:2509.18221v11 citationsh-index: 22025 International Conference on Artificial Intelligence, Human-Computer Interaction and Natural Language Processing (ICAHN)
Originality Incremental advance
AI Analysis

This work addresses the need for a unified AI framework to predict health risks for patients with chronic diseases, though it appears incremental as it builds on existing visual-linguistic models.

The paper tackled the problem of predicting chronic disease health risks from multimodal clinical data by proposing VL-RiskFormer, a hierarchical multimodal Transformer with an LLM inference head, achieving an average AUROC of 0.90 and an expected calibration error of 2.7% on the MIMIC-IV cohort.

With the rising global burden of chronic diseases and the multimodal and heterogeneous clinical data (medical imaging, free-text recordings, wearable sensor streams, etc.), there is an urgent need for a unified multimodal AI framework that can proactively predict individual health risks. We propose VL-RiskFormer, a hierarchical stacked visual-language multimodal Transformer with a large language model (LLM) inference head embedded in its top layer. The system builds on the dual-stream architecture of existing visual-linguistic models (e.g., PaLM-E, LLaVA) with four key innovations: (i) pre-training with cross-modal comparison and fine-grained alignment of radiological images, fundus maps, and wearable device photos with corresponding clinical narratives using momentum update encoders and debiased InfoNCE losses; (ii) a time fusion block that integrates irregular visit sequences into the causal Transformer decoder through adaptive time interval position coding; (iii) a disease ontology map adapter that injects ICD-10 codes into visual and textual channels in layers and infers comorbid patterns with the help of a graph attention mechanism. On the MIMIC-IV longitudinal cohort, VL-RiskFormer achieved an average AUROC of 0.90 with an expected calibration error of 2.7 percent.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes