Shayan Jalilian

h-index1
2papers

2 Papers

11.3CVMay 24
CLIP-Guided SAM: Parameter-Efficient Semantic Conditioning for Promptable Segmentation

Shayan Jalilian, Abdul Bais

Promptable foundation models such as the Segment Anything Model (SAM) produce high-quality masks but remain semantically blind, relying on external prompts to specify categories. Existing vision-language approaches address this limitation by using external prompt coupling, where a vision-language model generates spatial prompts for SAM as a separate stage. We propose CLIP-Guided SAM, a parameter-efficient segmentation framework built on internal semantic conditioning. Instead of using semantic signals only to generate prompts, we inject CLIP-derived text, vision, and similarity features directly into SAM's image encoder through lightweight multi-modal semantic adapters. These adapters condition SAM's internal feature representations, allowing semantic information to influence mask prediction while preserving SAM's original promptable interface. Our framework is designed for low labeled-data settings and applies to both general-domain benchmarks and specialized downstream tasks. It supports two operating modes: Manual mode, for interactive segmentation with both text and spatial prompts, and Semi-Automatic text-only mode, for applications that require concept-specific segmentation using only textual input. We show that robustness depends on aligning training with the type of prompts used at inference, making train-test prompt consistency an important design principle. Through extensive experiments and ablations, we evaluate our method against SAM+PEFT baselines without semantic conditioning, vision-language + SAM pipelines, SAM 3, and strong semi-supervised segmentation methods that rely on large amounts of unlabeled data. Across these settings, CLIP-Guided SAM consistently achieves superior or competitive performance while remaining parameter-efficient in both training and deployment.

CVJul 31, 2025
SAM-PTx: Text-Guided Fine-Tuning of SAM with Parameter-Efficient, Parallel-Text Adapters

Shayan Jalilian, Abdul Bais

The Segment Anything Model (SAM) has demonstrated impressive generalization in prompt-based segmentation. Yet, the potential of semantic text prompts remains underexplored compared to traditional spatial prompts like points and boxes. This paper introduces SAM-PTx, a parameter-efficient approach for adapting SAM using frozen CLIP-derived text embeddings as class-level semantic guidance. Specifically, we propose a lightweight adapter design called Parallel-Text that injects text embeddings into SAM's image encoder, enabling semantics-guided segmentation while keeping most of the original architecture frozen. Our adapter modifies only the MLP-parallel branch of each transformer block, preserving the attention pathway for spatial reasoning. Through supervised experiments and ablations on the COD10K dataset as well as low-data subsets of COCO and ADE20K, we show that incorporating fixed text embeddings as input improves segmentation performance over purely spatial prompt baselines. To our knowledge, this is the first work to use text prompts for segmentation on the COD10K dataset. These results suggest that integrating semantic conditioning into SAM's architecture offers a practical and scalable path for efficient adaptation with minimal computational complexity.