CVJul 15, 2024

Adaptive Patch Contrast for Weakly Supervised Semantic Segmentation

Wangyu Wu, Tianhong Dai, Zhenhong Chen, Xiaowei Huang, Jimin Xiao, Fei Ma, Renrong Ouyang

arXiv:2407.10649v214.743 citationsh-index: 10

Originality Incremental advance

AI Analysis

This work addresses the efficiency and accuracy limitations in weakly supervised semantic segmentation for computer vision applications, representing an incremental improvement over existing methods.

The paper tackles the problem of weakly supervised semantic segmentation using only image-level labels by introducing Adaptive Patch Contrast (APC), a Vision Transformer-based method that improves patch embedding learning and training efficiency, achieving state-of-the-art results on PASCAL VOC 2012 and MS COCO 2014 datasets with shorter training times.

Weakly Supervised Semantic Segmentation (WSSS) using only image-level labels has gained significant attention due to its cost-effectiveness. The typical framework involves using image-level labels as training data to generate pixel-level pseudo-labels with refinements. Recently, methods based on Vision Transformers (ViT) have demonstrated superior capabilities in generating reliable pseudo-labels, particularly in recognizing complete object regions, compared to CNN methods. However, current ViT-based approaches have some limitations in the use of patch embeddings, being prone to being dominated by certain abnormal patches, as well as many multi-stage methods being time-consuming and lengthy in training, thus lacking efficiency. Therefore, in this paper, we introduce a novel ViT-based WSSS method named \textit{Adaptive Patch Contrast} (APC) that significantly enhances patch embedding learning for improved segmentation effectiveness. APC utilizes an Adaptive-K Pooling (AKP) layer to address the limitations of previous max pooling selection methods. Additionally, we propose a Patch Contrastive Learning (PCL) to enhance patch embeddings, thereby further improving the final results. Furthermore, we improve upon the existing multi-stage training framework without CAM by transforming it into an end-to-end single-stage training approach, thereby enhancing training efficiency. The experimental results show that our approach is effective and efficient, outperforming other state-of-the-art WSSS methods on the PASCAL VOC 2012 and MS COCO 2014 dataset within a shorter training duration.

View on arXiv PDF

Similar