USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
This work addresses a specific bottleneck in integrating self-supervised vision models with diffusion models for image generation and understanding, offering incremental advancements in efficiency and quality.
The paper tackles the challenge of transferring pretrained weights from vision models to diffusion models due to input mismatches and latent space issues, proposing Unified Self-supervised Pretraining (USP) to initialize diffusion models via masked latent modeling in a VAE latent space, resulting in comparable performance in understanding tasks and significant improvements in convergence speed and generation quality.
Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models. Our code will be publicly available at https://github.com/AMAP-ML/USP.