CVApr 3, 2023

WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

arXiv:2304.01184v247 citationsh-index: 73Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of semantic segmentation with weak supervision for computer vision researchers, offering an incremental improvement over existing methods.

The paper tackles weakly-supervised semantic segmentation by exploring plain Vision Transformers, proposing a weight-based method to fuse attention heads for better class activation maps and a gradient clipping decoder for retraining, achieving state-of-the-art results of 78.4% mIoU on PASCAL VOC 2012 and 50.3% mIoU on COCO 2014.

This paper explores the properties of the plain Vision Transformer (ViT) for Weakly-supervised Semantic Segmentation (WSSS). The class activation map (CAM) is of critical importance for understanding a classification network and launching WSSS. We observe that different attention heads of ViT focus on different image areas. Thus a novel weight-based method is proposed to end-to-end estimate the importance of attention heads, while the self-attention maps are adaptively fused for high-quality CAM results that tend to have more complete objects. Besides, we propose a ViT-based gradient clipping decoder for online retraining with the CAM results to complete the WSSS task. We name this plain Transformer-based Weakly-supervised learning framework WeakTr. It achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 78.4% mIoU on the val set of PASCAL VOC 2012 and 50.3% mIoU on the val set of COCO 2014. Code is available at https://github.com/hustvl/WeakTr.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes