CVNov 22, 2024

Effective SAM Combination for Open-Vocabulary Semantic Segmentation

arXiv:2411.14723v213 citationsh-index: 16CVPR
Originality Incremental advance
AI Analysis

This work addresses efficiency issues in open-vocabulary segmentation for computer vision applications, representing an incremental improvement over existing two-stage methods.

The paper tackles the problem of high computational cost and memory inefficiency in open-vocabulary semantic segmentation by proposing ESC-Net, a one-stage model that integrates SAM decoder blocks with pseudo prompts, achieving superior performance and efficiency on benchmarks like ADE20K, PASCAL-VOC, and PASCAL-Context.

Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre-trained vision-language model like CLIP. But these two-stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC-Net, a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image-text correlations into SAM's promptable segmentation framework, ESC-Net achieves refined spatial aggregation for accurate mask predictions. ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes