CVAILGMay 7, 2024

Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

arXiv:2405.03958v311 citationsh-index: 5Trans. Mach. Learn. Res.
Originality Synthesis-oriented
AI Analysis

This is an incremental improvement for diffusion model practitioners, enhancing conditioning in attention layers without architectural changes.

The paper tackled the problem of suboptimal conditioning in diffusion models by adding LoRA conditioning to attention layers, resulting in improved image generation quality with FID scores of 1.91/1.75 compared to a baseline of 1.97/1.79 on CIFAR-10.

Current state-of-the-art diffusion models employ U-Net architectures containing convolutional and (qkv) self-attention layers. The U-Net processes images while being conditioned on the time embedding input for each sampling step and the class or caption embedding input corresponding to the desired conditional generation. Such conditioning involves scale-and-shift operations to the convolutional layers but does not directly affect the attention layers. While these standard architectural choices are certainly effective, not conditioning the attention layers feels arbitrary and potentially suboptimal. In this work, we show that simply adding LoRA conditioning to the attention layers without changing or tuning the other parts of the U-Net architecture improves the image generation quality. For example, a drop-in addition of LoRA conditioning to EDM diffusion model yields FID scores of 1.91/1.75 for unconditional and class-conditional CIFAR-10 generation, improving upon the baseline of 1.97/1.79.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes