SAM2-3dMed: Empowering SAM2 for 3D Medical Image Segmentation
This work improves 3D medical image segmentation for clinical applications like disease assessment, offering a general paradigm for adapting video models to volumetric data, but it is incremental as it builds on an existing foundation model.
The paper tackled the problem of adapting the Segment Anything Model 2 (SAM2) from video to 3D medical image segmentation by addressing domain gaps in anatomical continuity and boundary delineation, resulting in SAM2-3dMed which significantly outperformed state-of-the-art methods on three medical datasets with superior segmentation overlap and boundary precision.
Accurate segmentation of 3D medical images is critical for clinical applications like disease assessment and treatment planning. While the Segment Anything Model 2 (SAM2) has shown remarkable success in video object segmentation by leveraging temporal cues, its direct application to 3D medical images faces two fundamental domain gaps: 1) the bidirectional anatomical continuity between slices contrasts sharply with the unidirectional temporal flow in videos, and 2) precise boundary delineation, crucial for morphological analysis, is often underexplored in video tasks. To bridge these gaps, we propose SAM2-3dMed, an adaptation of SAM2 for 3D medical imaging. Our framework introduces two key innovations: 1) a Slice Relative Position Prediction (SRPP) module explicitly models bidirectional inter-slice dependencies by guiding SAM2 to predict the relative positions of different slices in a self-supervised manner; 2) a Boundary Detection (BD) module enhances segmentation accuracy along critical organ and tissue boundaries. Extensive experiments on three diverse medical datasets (the Lung, Spleen, and Pancreas in the Medical Segmentation Decathlon (MSD) dataset) demonstrate that SAM2-3dMed significantly outperforms state-of-the-art methods, achieving superior performance in segmentation overlap and boundary precision. Our approach not only advances 3D medical image segmentation performance but also offers a general paradigm for adapting video-centric foundation models to spatial volumetric data.