CVMay 18

Benchmarking transferability of SSL pretraining to same and different modality segmentation tasks

arXiv:2605.1849139.5

Predicted impact top 79% in CV · last 90 daysOriginality Synthesis-oriented

AI Analysis

For medical image segmentation practitioners, this study provides a comprehensive benchmark showing that masked image modeling with self-distillation (SMIT) outperforms contrastive and rotation-based methods, especially in few-shot scenarios.

This paper benchmarks nine self-supervised learning (SSL) methods on 3D CT scans and evaluates their transferability to nine segmentation tasks. Self-distilled masked image transformer (SMIT) achieved the highest segmentation accuracy, fastest convergence, and best data efficiency, with SSL choice mattering most under limited annotations.

Methods: Nine SSL methods spanning four pretext-task families were pretrained from scratch using the same 10{,}412 3D CT scans (1.89~M 2D axial slices) covering varied disease sites. The pretrained Swin Transformer encoder from each method was integrated into a SwinUNETR-style segmentation network (Swin encoder with a 3D CNN decoder and skip connections) and fine-tuned on nine public segmentation tasks of varying complexity, including large abdominal organs, head-and-neck structures, and tumors from CT and MRI. Performance was assessed using Dice similarity coefficient (DSC). Fine-tuning convergence speed, transferability across modalities (CT-to-MRI), and feature-reuse patterns between few- and many-shot fine tuning were further analyzed using centered kernel alignment. Results: Self-distilled masked image transformer (SMIT), which combines masked image modeling (MIM) with local and global self-distillation, achieved the highest overall segmentation accuracy across the nine tasks, the fastest fine-tuning convergence, and the smallest few-shot-to-many-shot performance gap, indicating the strongest data efficiency. SMIT also showed the most consistent feature-reuse patterns between few- and many-shot fine tuning. MIM-based SimMIM and self-distillation methods (DINO, iBOT) outperformed contrastive learning and rotation prediction, which rely on image-level global representations. Differences between SSL methods were largest in the few-shot setting and narrowed as the size of the labeled fine-tuning dataset increased, indicating that the choice of SSL pretraining matters most under limited annotation budgets.

View on arXiv PDF

Similar