Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation Models
This work addresses the problem of reducing labeling costs for semantic segmentation in computer vision, but it is incremental as it primarily reviews existing methods and investigates known models.
The paper surveys traditional methods for weakly-supervised semantic segmentation using image-level labels and explores the application of visual foundation models like SAM in text prompting and zero-shot learning scenarios, highlighting their potential and challenges.
The rapid development of deep learning has driven significant progress in image semantic segmentation - a fundamental task in computer vision. Semantic segmentation algorithms often depend on the availability of pixel-level labels (i.e., masks of objects), which are expensive, time-consuming, and labor-intensive. Weakly-supervised semantic segmentation (WSSS) is an effective solution to avoid such labeling. It utilizes only partial or incomplete annotations and provides a cost-effective alternative to fully-supervised semantic segmentation. In this journal, our focus is on the WSSS with image-level labels, which is the most challenging form of WSSS. Our work has two parts. First, we conduct a comprehensive survey on traditional methods, primarily focusing on those presented at premier research conferences. We categorize them into four groups based on where their methods operate: pixel-wise, image-wise, cross-image, and external data. Second, we investigate the applicability of visual foundation models, such as the Segment Anything Model (SAM), in the context of WSSS. We scrutinize SAM in two intriguing scenarios: text prompting and zero-shot learning. We provide insights into the potential and challenges of deploying visual foundational models for WSSS, facilitating future developments in this exciting research area.