CVAILGMLMar 21, 2024

OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation

arXiv:2403.14183v220 citationsh-index: 10ECCV
Originality Incremental advance
AI Analysis

This work addresses limitations in zero-shot semantic segmentation for computer vision applications, representing an incremental improvement over existing CLIP-based methods.

The paper tackles the problem of aligning text and pixel embeddings for zero-shot semantic segmentation by proposing OTSeg, a novel multimodal attention mechanism that uses multi-prompt Sinkhorn attention, achieving state-of-the-art performance with significant gains on three benchmark datasets.

The recent success of CLIP has demonstrated promising results in zero-shot semantic segmentation by transferring muiltimodal knowledge to pixel-level classification. However, leveraging pre-trained CLIP knowledge to closely align text embeddings with pixel embeddings still has limitations in existing approaches. To address this issue, we propose OTSeg, a novel multimodal attention mechanism aimed at enhancing the potential of multiple text prompts for matching associated pixel embeddings. We first propose Multi-Prompts Sinkhorn (MPS) based on the Optimal Transport (OT) algorithm, which leads multiple text prompts to selectively focus on various semantic features within image pixels. Moreover, inspired by the success of Sinkformers in unimodal settings, we introduce the extension of MPS, called Multi-Prompts Sinkhorn Attention (MPSA) , which effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings. Through extensive experiments, we demonstrate that OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic Segmentation (ZS3) tasks across three benchmark datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes