CV CL LGDec 16, 2024

Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

Hao Li, Shamit Lal, Zhiheng Li, Yusheng Xie, Ying Wang, Yang Zou, Orchid Majumder, R. Manmatha, Zhuowen Tu, Stefano Ermon, Stefano Soatto, Ashwin Swaminathan

Amazon

arXiv:2412.12391v16.53 citationsh-index: 11

Originality Incremental advance

AI Analysis

This work addresses scaling challenges in text-to-image generation for AI researchers, but it is incremental as it builds on existing DiT methods.

The study investigates scaling Diffusion Transformers for text-to-image generation, finding that a 2.3B parameter U-ViT model outperforms SDXL UNet and other variants in controlled settings, with experiments on datasets up to 600M images.

We empirically study the scaling properties of various Diffusion Transformers (DiTs) for text-to-image generation by performing extensive and rigorous ablations, including training scaled DiTs ranging from 0.3B upto 8B parameters on datasets up to 600M images. We find that U-ViT, a pure self-attention based DiT model provides a simpler design and scales more effectively in comparison with cross-attention based DiT variants, which allows straightforward expansion for extra conditions and other modalities. We identify a 2.3B U-ViT model can get better performance than SDXL UNet and other DiT variants in controlled setting. On the data scaling side, we investigate how increasing dataset size and enhanced long caption improve the text-image alignment performance and the learning efficiency.

View on arXiv PDF

Similar