Exploring Transformer Backbones for Image Diffusion Models
This work simplifies architecture for image diffusion models, potentially easing fusion of text and image data, but it is incremental as it shows comparable rather than superior performance.
The authors tackled image synthesis by proposing a Transformer-based Latent Diffusion model, achieving a 14.1 FID score on ImageNet, which is comparable to the 13.1 FID of UNet-based models.
We present an end-to-end Transformer based Latent Diffusion model for image synthesis. On the ImageNet class conditioned generation task we show that a Transformer based Latent Diffusion model achieves a 14.1FID which is comparable to the 13.1FID score of a UNet based architecture. In addition to showing the application of Transformer models for Diffusion based image synthesis this simplification in architecture allows easy fusion and modeling of text and image data. The multi-head attention mechanism of Transformers enables simplified interaction between the image and text features which removes the requirement for crossattention mechanism in UNet based Diffusion models.