Self-supervised learning of imaging and clinical signatures using a multimodal joint-embedding predictive architecture
This work addresses the challenge of building robust multimodal diagnostic models for pulmonary nodules in medical imaging, though it is incremental as it adapts existing self-supervised techniques to a specific domain.
The paper tackled the problem of limited labeled data and overfitting in multimodal models for pulmonary nodule diagnosis by using self-supervised learning with a joint embedding predictive architecture (JEPA) on unlabeled CT scans and electronic health records. It showed improved performance in an internal cohort (0.91 AUC vs. 0.88 multimodal and 0.73 imaging-only) but underperformed in an external cohort (0.72 AUC vs. 0.75 imaging-only).
The development of multimodal models for pulmonary nodule diagnosis is limited by the scarcity of labeled data and the tendency for these models to overfit on the training distribution. In this work, we leverage self-supervised learning from longitudinal and multimodal archives to address these challenges. We curate an unlabeled set of patients with CT scans and linked electronic health records from our home institution to power joint embedding predictive architecture (JEPA) pretraining. After supervised finetuning, we show that our approach outperforms an unregularized multimodal model and imaging-only model in an internal cohort (ours: 0.91, multimodal: 0.88, imaging-only: 0.73 AUC), but underperforms in an external cohort (ours: 0.72, imaging-only: 0.75 AUC). We develop a synthetic environment that characterizes the context in which JEPA may underperform. This work innovates an approach that leverages unlabeled multimodal medical archives to improve predictive models and demonstrates its advantages and limitations in pulmonary nodule diagnosis.