Masked Diffusion as Self-supervised Representation Learner
This work addresses the need for scalable self-supervised representation learning in semantic segmentation, especially for medical and natural images, though it appears incremental as it modifies existing diffusion models.
The paper tackled the problem of decomposing the interrelation between generative capability and representation learning in diffusion models by proposing a masked diffusion model (MDM) as a self-supervised representation learner, which convincingly surpassed prior benchmarks in semantic segmentation tasks, particularly in few-shot scenarios.
Denoising diffusion probabilistic models have recently demonstrated state-of-the-art generative performance and have been used as strong pixel-level representation learners. This paper decomposes the interrelation between the generative capability and representation learning ability inherent in diffusion models. We present the masked diffusion model (MDM), a scalable self-supervised representation learner for semantic segmentation, substituting the conventional additive Gaussian noise of traditional diffusion with a masking mechanism. Our proposed approach convincingly surpasses prior benchmarks, demonstrating remarkable advancements in both medical and natural image semantic segmentation tasks, particularly in few-shot scenarios.