Inferring the Class Conditional Response Map for Weakly Supervised Semantic Segmentation
This work addresses the inefficiency of multi-training pipelines in weakly supervised semantic segmentation for computer vision applications, offering a more streamlined approach.
The paper tackles the problem of generating incomplete pseudo labels in weakly supervised semantic segmentation by proposing a class-conditional inference strategy and an activation aware mask refinement loss to produce more complete response maps without re-training the classifier, achieving superior results on benchmarks like PASCAL VOC 2012 with 73.2% mIoU.
Image-level weakly supervised semantic segmentation (WSSS) relies on class activation maps (CAMs) for pseudo labels generation. As CAMs only highlight the most discriminative regions of objects, the generated pseudo labels are usually unsatisfactory to serve directly as supervision. To solve this, most existing approaches follow a multi-training pipeline to refine CAMs for better pseudo-labels, which includes: 1) re-training the classification model to generate CAMs; 2) post-processing CAMs to obtain pseudo labels; and 3) training a semantic segmentation model with the obtained pseudo labels. However, this multi-training pipeline requires complicated adjustment and additional time. To address this, we propose a class-conditional inference strategy and an activation aware mask refinement loss function to generate better pseudo labels without re-training the classifier. The class conditional inference-time approach is presented to separately and iteratively reveal the classification network's hidden object activation to generate more complete response maps. Further, our activation aware mask refinement loss function introduces a novel way to exploit saliency maps during segmentation training and refine the foreground object masks without suppressing background objects. Our method achieves superior WSSS results without requiring re-training of the classifier.