MMSF: Multitask and Multimodal Supervised Framework for WSI Classification and Survival Analysis
This work addresses the problem of improving cancer diagnosis and prognosis for patients by integrating heterogeneous data in computational pathology, representing a strong specific gain with incremental methodological advancements.
The paper tackles the challenge of integrating multimodal data (whole slide images and clinical descriptors) for computational pathology by introducing MMSF, a multitask and multimodal supervised framework, which achieves accuracy and AUC improvements of 2.1–6.6% and 2.2–6.9% on classification tasks and C-index improvements of 7.1–9.8% on survival analysis compared to baselines.
Multimodal evidence is critical in computational pathology: gigapixel whole slide images capture tumor morphology, while patient-level clinical descriptors preserve complementary context for prognosis. Integrating such heterogeneous signals remains challenging because feature spaces exhibit distinct statistics and scales. We introduce MMSF, a multitask and multimodal supervised framework built on a linear-complexity MIL backbone that explicitly decomposes and fuses cross-modal information. MMSF comprises a graph feature extraction module embedding tissue topology at the patch level, a clinical data embedding module standardizing patient attributes, a feature fusion module aligning modality-shared and modality-specific representations, and a Mamba-based MIL encoder with multitask prediction heads. Experiments on CAMELYON16 and TCGA-NSCLC demonstrate 2.1--6.6\% accuracy and 2.2--6.9\% AUC improvements over competitive baselines, while evaluations on five TCGA survival cohorts yield 7.1--9.8\% C-index improvements compared with unimodal methods and 5.6--7.1\% over multimodal alternatives.