Diverse Sampling for Self-Supervised Learning of Semantic Segmentation
This addresses the problem of reducing annotation effort for semantic segmentation, though it is incremental in leveraging existing self-supervision ideas.
The paper tackles learning semantic segmentation from image-level classification tags by using localization cues from classification networks to automatically label a sparse, diverse set of points, achieving competitive results on the VOC 2012 benchmark with training in under 3 minutes.
We propose an approach for learning category-level semantic segmentation purely from image-level classification tags indicating presence of categories. It exploits localization cues that emerge from training classification-tasked convolutional networks, to drive a "self-supervision" process that automatically labels a sparse, diverse training set of points likely to belong to classes of interest. Our approach has almost no hyperparameters, is modular, and allows for very fast training of segmentation in less than 3 minutes. It obtains competitive results on the VOC 2012 segmentation benchmark. More, significantly the modularity and fast training of our framework allows new classes to efficiently added for inference.