CVMar 21, 2023

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models

Weijia Wu, Yuzhong Zhao, Mike Zheng Shou, Hong Zhou, Chunhua Shen

arXiv:2303.11681v435.7199 citationsh-index: 126Has Code

Originality Incremental advance

AI Analysis

This reduces data collection and annotation costs for semantic segmentation tasks, though it is incremental as it builds on existing diffusion models.

The paper tackles the problem of costly pixel-level annotation for semantic segmentation by proposing DiffuMask, which automatically generates accurate semantic masks from synthetic images using Stable Diffusion's cross-attention maps, achieving competitive performance on datasets like VOC 2012 and Cityscapes, with some classes (e.g., bird) within a 3% mIoU gap of real data and setting a new SOTA in open-vocabulary segmentation.

Collecting and annotating images with pixel-wise labels is time-consuming and laborious. In contrast, synthetic data can be freely available using a generative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the Off-the-shelf Stable Diffusion model, which uses only text-image pairs during training. Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image, which is natural and seamless to extend the text-driven image synthesis to semantic mask generation. DiffuMask uses text-guided cross-attention information to localize class/word-specific regions, which are combined with practical techniques to create a novel high-resolution and class-discriminative pixel-wise mask. The methods help to reduce data collection and annotation costs obviously. Experiments demonstrate that the existing segmentation methods trained on synthetic data of DiffuMask can achieve a competitive performance over the counterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird), DiffuMask presents promising performance, close to the stateof-the-art result of real data (within 3% mIoU gap). Moreover, in the open-vocabulary segmentation (zero-shot) setting, DiffuMask achieves a new SOTA result on Unseen class of VOC 2012. The project website can be found at https://weijiawu.github.io/DiffusionMask/.

View on arXiv PDF Code

Similar