CVAINov 5, 2024

On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models

arXiv:2411.03177v216 citationsh-index: 35NIPS
Originality Incremental advance
AI Analysis

This work addresses reproducibility and performance gaps in diffusion models for the research community, offering incremental improvements in conditioning and efficiency.

The paper tackled the problem of inconsistent training recipes in latent diffusion models by re-implementing five models for fair comparisons and exploring conditioning mechanisms and pre-training strategies, resulting in a novel conditioning mechanism that improved FID scores by 7-23% on ImageNet-1k and CC12M datasets.

Large-scale training of latent diffusion models (LDMs) has enabled unprecedented quality in image generation. However, the key components of the best performing LDM training recipes are oftentimes not available to the research community, preventing apple-to-apple comparisons and hindering the validation of progress in the field. In this work, we perform an in-depth study of LDM training recipes focusing on the performance of models and their training efficiency. To ensure apple-to-apple comparisons, we re-implement five previously published models with their corresponding recipes. Through our study, we explore the effects of (i)~the mechanisms used to condition the generative model on semantic information (e.g., text prompt) and control metadata (e.g., crop size, random flip flag, etc.) on the model performance, and (ii)~the transfer of the representations learned on smaller and lower-resolution datasets to larger ones on the training efficiency and model performance. We then propose a novel conditioning mechanism that disentangles semantic and control metadata conditionings and sets a new state-of-the-art in class-conditional generation on the ImageNet-1k dataset -- with FID improvements of 7% on 256 and 8% on 512 resolutions -- as well as text-to-image generation on the CC12M dataset -- with FID improvements of 8% on 256 and 23% on 512 resolution.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes