Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors
This work addresses the challenge of semantic grounding for researchers and practitioners in computer vision by enabling zero-shot segmentation without additional training, though it is incremental in leveraging existing diffusion models.
The authors tackled the problem of semantic localization using text-to-image diffusion models without segmentation-specific training, achieving competitive results on Pascal VOC and RefCOCO datasets for zero-shot, open-vocabulary segmentation.
Recently, text-to-image diffusion models have shown remarkable capabilities in creating realistic images from natural language prompts. However, few works have explored using these models for semantic localization or grounding. In this work, we explore how an off-the-shelf text-to-image diffusion model, trained without exposure to localization information, can ground various semantic phrases without segmentation-specific re-training. We introduce an inference time optimization process capable of generating segmentation masks conditioned on natural language prompts. Our proposal, Peekaboo, is a first-of-its-kind zero-shot, open-vocabulary, unsupervised semantic grounding technique leveraging diffusion models without any training. We evaluate Peekaboo on the Pascal VOC dataset for unsupervised semantic segmentation and the RefCOCO dataset for referring segmentation, showing results competitive with promising results. We also demonstrate how Peekaboo can be used to generate images with transparency, even though the underlying diffusion model was only trained on RGB images - which to our knowledge we are the first to attempt. Please see our project page, including our code: https://ryanndagreat.github.io/peekaboo