CVNov 6, 2024

Boosting Latent Diffusion with Perceptual Objectives

arXiv:2411.04873v217 citationsh-index: 35ICLR
Originality Incremental advance
AI Analysis

This addresses image quality issues in state-of-the-art generative models for applications like image synthesis, though it is an incremental improvement on existing LDMs.

The paper tackles the problem of detail loss in latent diffusion models (LDMs) by proposing a latent perceptual loss (LPL) that leverages decoder features, resulting in sharper and more realistic images with FID improvements of 6-20% across datasets at 256 and 512 resolutions.

Latent diffusion models (LDMs) power state-of-the-art high-resolution generative image models. LDMs learn the data distribution in the latent space of an autoencoder (AE) and produce images by mapping the generated latents into RGB image space using the AE decoder. While this approach allows for efficient model training and sampling, it induces a disconnect between the training of the diffusion model and the decoder, resulting in a loss of detail in the generated images. To remediate this disconnect, we propose to leverage the internal features of the decoder to define a latent perceptual loss (LPL). This loss encourages the models to create sharper and more realistic images. Our loss can be seamlessly integrated with common autoencoders used in latent diffusion models, and can be applied to different generative modeling paradigms such as DDPM with epsilon and velocity prediction, as well as flow matching. Extensive experiments with models trained on three datasets at 256 and 512 resolution show improved quantitative -- with boosts between 6% and 20% in FID -- and qualitative results when using our perceptual loss.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes