G4Seg: Generation for Inexact Segmentation Refinement with Diffusion Models
This addresses segmentation refinement for computer vision applications, representing an incremental advance by applying generative models to a discriminative task.
The paper tackles the inexact segmentation problem by using a text-to-image diffusion model to generate mask-conditional images and exploiting pattern discrepancies with original images for refinement, achieving superior performance validated through comprehensive experiments.
This paper considers the problem of utilizing a large-scale text-to-image diffusion model to tackle the challenging Inexact Segmentation (IS) task. Unlike traditional approaches that rely heavily on discriminative-model-based paradigms or dense visual representations derived from internal attention mechanisms, our method focuses on the intrinsic generative priors in Stable Diffusion~(SD). Specifically, we exploit the pattern discrepancies between original images and mask-conditional generated images to facilitate a coarse-to-fine segmentation refinement by establishing a semantic correspondence alignment and updating the foreground probability. Comprehensive quantitative and qualitative experiments validate the effectiveness and superiority of our plug-and-play design, underscoring the potential of leveraging generation discrepancies to model dense representations and encouraging further exploration of generative approaches for solving discriminative tasks.