Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data
This work addresses the data efficiency challenge for segmentation models, benefiting researchers and practitioners in computer vision, though it is incremental as it builds on existing SAM variants.
The paper tackles the problem of reducing the massive training data requirements of Segment Anything Models (SAM) by proposing a lightweight RGB-D fusion framework that uses monocular depth priors, achieving higher accuracy than EfficientViT-SAM with only 11.2k training samples (less than 0.1% of SA-1B).
Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1\% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.