CVMay 18

Benchmarking transferability of SSL pretraining to same and different modality segmentation tasks

arXiv:2605.1849139.5
Predicted impact top 79% in CV · last 90 daysOriginality Synthesis-oriented
AI Analysis

For medical image segmentation practitioners, this study provides a comprehensive benchmark showing that masked image modeling with self-distillation (SMIT) outperforms contrastive and rotation-based methods, especially in few-shot scenarios.

This paper benchmarks nine self-supervised learning (SSL) methods on 3D CT scans and evaluates their transferability to nine segmentation tasks. Self-distilled masked image transformer (SMIT) achieved the highest segmentation accuracy, fastest convergence, and best data efficiency, with SSL choice mattering most under limited annotations.

Methods: Nine SSL methods spanning four pretext-task families were pretrained from scratch using the same 10{,}412 3D CT scans (1.89~M 2D axial slices) covering varied disease sites. The pretrained Swin Transformer encoder from each method was integrated into a SwinUNETR-style segmentation network (Swin encoder with a 3D CNN decoder and skip connections) and fine-tuned on nine public segmentation tasks of varying complexity, including large abdominal organs, head-and-neck structures, and tumors from CT and MRI. Performance was assessed using Dice similarity coefficient (DSC). Fine-tuning convergence speed, transferability across modalities (CT-to-MRI), and feature-reuse patterns between few- and many-shot fine tuning were further analyzed using centered kernel alignment. Results: Self-distilled masked image transformer (SMIT), which combines masked image modeling (MIM) with local and global self-distillation, achieved the highest overall segmentation accuracy across the nine tasks, the fastest fine-tuning convergence, and the smallest few-shot-to-many-shot performance gap, indicating the strongest data efficiency. SMIT also showed the most consistent feature-reuse patterns between few- and many-shot fine tuning. MIM-based SimMIM and self-distillation methods (DINO, iBOT) outperformed contrastive learning and rotation prediction, which rely on image-level global representations. Differences between SSL methods were largest in the few-shot setting and narrowed as the size of the labeled fine-tuning dataset increased, indicating that the choice of SSL pretraining matters most under limited annotation budgets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes