CV LGMar 25, 2025

BiPrompt-SAM: Enhancing Image Segmentation via Explicit Selection between Point and Text Prompts

arXiv:2503.19769v31 citationsh-index: 1

Originality Incremental advance

AI Analysis

This work addresses the problem of reducing annotation burden and improving segmentation accuracy for computer vision practitioners, particularly in medical and general domains, by offering a simple and interpretable multi-modal fusion approach, though it is incremental as it builds on existing models like SAM and BEIT-3.

The paper tackled the challenge of combining point and text prompts for image segmentation by introducing BiPrompt-SAM, a dual-modal framework with an explicit selection mechanism, achieving strong zero-shot performance on medical datasets (e.g., 89.55% mDice on Endovis17) and outperforming existing methods on RefCOCO (up to 87.1% IoU).

Segmentation is a fundamental task in computer vision, with prompt-driven methods gaining prominence due to their flexibility. The Segment Anything Model (SAM) excels at point-prompted segmentation, while text-based models, often leveraging powerful multimodal encoders like BEIT-3, provide rich semantic understanding. However, effectively combining these complementary modalities remains a challenge. This paper introduces BiPrompt-SAM, a novel dual-modal prompt segmentation framework employing an explicit selection mechanism. We leverage SAM's ability to generate multiple mask candidates from a single point prompt and use a text-guided mask (generated via EVF-SAM with BEIT-3) to select the point-generated mask that best aligns spatially, measured by Intersection over Union (IoU). This approach, interpretable as a simplified Mixture of Experts (MoE), effectively fuses spatial precision and semantic context without complex model modifications. Notably, our method achieves strong zero-shot performance on the Endovis17 medical dataset (89.55% mDice, 81.46% mIoU) using only a single point prompt per instance. This significantly reduces annotation burden compared to bounding boxes and aligns better with practical clinical workflows, demonstrating the method's effectiveness without domain-specific training. On the RefCOCO series, BiPrompt-SAM attained 87.1%, 86.5%, and 85.8% IoU, significantly outperforming existing approaches. Experiments show BiPrompt-SAM excels in scenarios requiring both spatial accuracy and semantic disambiguation, offering a simple, effective, and interpretable perspective on multi-modal prompt fusion.

View on arXiv PDF

Similar