Latent Diffusion Models with Masked AutoEncoders
This work addresses a specific bottleneck in diffusion-based image generation models for researchers and practitioners, representing an incremental improvement by combining masked autoencoders with LDMs.
The authors tackled the problem of suboptimal autoencoder design in Latent Diffusion Models (LDMs) for image generation, identifying three key properties (latent smoothness, perceptual compression quality, and reconstruction quality) that existing autoencoders fail to satisfy simultaneously. They proposed Variational Masked AutoEncoders (VMAEs) to address this, integrating them into LDMs to create LDMAEs, which demonstrated improved performance in image generation tasks.
In spite of the remarkable potential of Latent Diffusion Models (LDMs) in image generation, the desired properties and optimal design of the autoencoders have been underexplored. In this work, we analyze the role of autoencoders in LDMs and identify three key properties: latent smoothness, perceptual compression quality, and reconstruction quality. We demonstrate that existing autoencoders fail to simultaneously satisfy all three properties, and propose Variational Masked AutoEncoders (VMAEs), taking advantage of the hierarchical features maintained by Masked AutoEncoders. We integrate VMAEs into the LDM framework, introducing Latent Diffusion Models with Masked AutoEncoders (LDMAEs). Our code is available at https://github.com/isno0907/ldmae.