Foreground-Background Separation through Concept Distillation from Generative Image Foundation Models
This addresses the difficulty of curating labeled datasets for object segmentation, particularly benefiting applications in computer vision and medical image analysis, though it is incremental as it builds on existing generative models.
The paper tackles the problem of generating foreground-background segmentation models without segmentation labels by using pre-trained latent diffusion models to create weak masks from textual descriptions, then fine-tuning for inpainting to produce synthetic datasets; it demonstrates improved performance over previous methods, closing the gap with fully supervised training on tasks like segmenting humans, dogs, cars, and birds.
Curating datasets for object segmentation is a difficult task. With the advent of large-scale pre-trained generative models, conditional image generation has been given a significant boost in result quality and ease of use. In this paper, we present a novel method that enables the generation of general foreground-background segmentation models from simple textual descriptions, without requiring segmentation labels. We leverage and explore pre-trained latent diffusion models, to automatically generate weak segmentation masks for concepts and objects. The masks are then used to fine-tune the diffusion model on an inpainting task, which enables fine-grained removal of the object, while at the same time providing a synthetic foreground and background dataset. We demonstrate that using this method beats previous methods in both discriminative and generative performance and closes the gap with fully supervised training while requiring no pixel-wise object labels. We show results on the task of segmenting four different objects (humans, dogs, cars, birds) and a use case scenario in medical image analysis. The code is available at https://github.com/MischaD/fobadiffusion.