CVSep 24, 2024

DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation

arXiv:2409.15801v120 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the challenge of limited supervision in semantic segmentation for computer vision applications, representing an incremental improvement over existing methods.

The paper tackled the problem of weakly supervised semantic segmentation by introducing DALNet, which uses text embeddings and a dual-level alignment strategy to improve object localization and context capture, achieving state-of-the-art results on PASCAL VOC and MS COCO datasets.

Weakly supervised semantic segmentation (WSSS) approaches typically rely on class activation maps (CAMs) for initial seed generation, which often fail to capture global context due to limited supervision from image-level labels. To address this issue, we introduce DALNet, Dense Alignment Learning Network that leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity. Our key insight is to employ a dual-level alignment strategy: (1) Global Implicit Alignment (GIA) to capture global semantics by maximizing the similarity between the class token and the corresponding text embeddings while minimizing the similarity with background embeddings, and (2) Local Explicit Alignment (LEA) to improve object localization by utilizing spatial information from patch tokens. Moreover, we propose a cross-contrastive learning approach that aligns foreground features between image and text modalities while separating them from the background, encouraging activation in missing regions and suppressing distractions. Through extensive experiments on the PASCAL VOC and MS COCO datasets, we demonstrate that DALNet significantly outperforms state-of-the-art WSSS methods. Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes