BiPrompt-SAM: Enhancing Image Segmentation via Explicit Selection between Point and Text Prompts
This work addresses the problem of reducing annotation burden and improving segmentation accuracy for computer vision practitioners, particularly in medical and general domains, by offering a simple and interpretable multi-modal fusion approach, though it is incremental as it builds on existing models like SAM and BEIT-3.
The paper tackled the challenge of combining point and text prompts for image segmentation by introducing BiPrompt-SAM, a dual-modal framework with an explicit selection mechanism, achieving strong zero-shot performance on medical datasets (e.g., 89.55% mDice on Endovis17) and outperforming existing methods on RefCOCO (up to 87.1% IoU).
Segmentation is a fundamental task in computer vision, with prompt-driven methods gaining prominence due to their flexibility. The Segment Anything Model (SAM) excels at point-prompted segmentation, while text-based models, often leveraging powerful multimodal encoders like BEIT-3, provide rich semantic understanding. However, effectively combining these complementary modalities remains a challenge. This paper introduces BiPrompt-SAM, a novel dual-modal prompt segmentation framework employing an explicit selection mechanism. We leverage SAM's ability to generate multiple mask candidates from a single point prompt and use a text-guided mask (generated via EVF-SAM with BEIT-3) to select the point-generated mask that best aligns spatially, measured by Intersection over Union (IoU). This approach, interpretable as a simplified Mixture of Experts (MoE), effectively fuses spatial precision and semantic context without complex model modifications. Notably, our method achieves strong zero-shot performance on the Endovis17 medical dataset (89.55% mDice, 81.46% mIoU) using only a single point prompt per instance. This significantly reduces annotation burden compared to bounding boxes and aligns better with practical clinical workflows, demonstrating the method's effectiveness without domain-specific training. On the RefCOCO series, BiPrompt-SAM attained 87.1%, 86.5%, and 85.8% IoU, significantly outperforming existing approaches. Experiments show BiPrompt-SAM excels in scenarios requiring both spatial accuracy and semantic disambiguation, offering a simple, effective, and interpretable perspective on multi-modal prompt fusion.