CVApr 10

M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation Model

Yihang Liu, Ying Wen, Jiaxiong Yang, Longzhen Yang, Lianghua He, Heng Tao Shen

arXiv:2604.0893663.7

Predicted impact top 54% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the degradation of modality specificity and diversity in medical foundation models, which is crucial for improving generalization in clinical applications, though it appears incremental as it builds on existing self-supervised and MoE approaches.

The paper tackles the problem of information ambiguity in medical foundation models by proposing M-IDoL, which uses information decomposition to learn modality-specific and diverse representations, resulting in superior generalization across 21 downstream clinical tasks and outperforming 20 existing models on five imaging modalities.

Medical foundation models (MFMs) aim to learn universal representations from multimodal medical images that can generalize effectively to diverse downstream clinical tasks. However, most existing MFMs suffer from information ambiguity that blend multimodal representations in a single embedding space, leading to the degradation of modality specificity and diversity. In this paper, we propose M-IDoL, a self-supervised \underline{\textit{M}}FM that introduces Information Decomposition for multimodal representation Learning via two objectives: i) maximize inter-modality entropy by dispersing multimodal representation into separable Mixture-of-Experts (MoE) subspaces to achieve representation specificity across modalities; and ii) minimize intra-modality uncertainty by performing fine-grained semantic discrimination within each MoE subspace to enrich representation diversity per modality. By pre-training on 1.15 million medical images, M-IDoL i) delivers superior generalization across 21 downstream clinical tasks, outperforming 20 foundation models on five imaging modalities (e.g., X-ray, fundus, OCT, dermoscopy and pathology), and ii) learns modality-specific and diverse representations, showing clearer separation of feature cluster across modalities and finer-grained feature discrimination within each modality.

View on arXiv PDF

Similar