Single-Stage Semantic Segmentation from Image Labels
This work addresses the need for simpler and more efficient semantic segmentation methods for computer vision researchers, though it is incremental as it builds on earlier single-stage ideas.
The authors tackled the problem of simplifying weakly supervised semantic segmentation by developing a single-stage method that uses image-level labels, achieving competitive results with more complex multi-stage approaches.
Recent years have seen a rapid growth in new approaches improving the accuracy of semantic segmentation in a weakly supervised setting, i.e. with only image-level labels available for training. However, this has come at the cost of increased model complexity and sophisticated multi-stage training procedures. This is in contrast to earlier work that used only a single stage $-$ training one segmentation network on image labels $-$ which was abandoned due to inferior segmentation accuracy. In this work, we first define three desirable properties of a weakly supervised method: local consistency, semantic fidelity, and completeness. Using these properties as guidelines, we then develop a segmentation-based network model and a self-supervised training scheme to train for semantic masks from image-level annotations in a single stage. We show that despite its simplicity, our method achieves results that are competitive with significantly more complex pipelines, substantially outperforming earlier single-stage methods.