Efficient Multimodal Diffusion Models Using Joint Data Infilling with Partially Shared U-Net
This work addresses efficiency and quality issues in multimodal data generation for AI applications, representing an incremental improvement over prior methods.
The paper tackles the inefficiency and interference in multimodal diffusion models by proposing a Partially Shared U-Net architecture and a joint data infilling sampling method, resulting in higher quality text and image generation on MS-COCO with faster training and sampling compared to existing models.
Recently, diffusion models have been used successfully to fit distributions for cross-modal data translation and multimodal data generation. However, these methods rely on extensive scaling, overlooking the inefficiency and interference between modalities. We develop Partially Shared U-Net (PS-U-Net) architecture which is an efficient multimodal diffusion model that allows text and image inputs to pass through dedicated layers and skip-connections for preserving modality-specific fine-grained details. Inspired by image inpainting, we also propose a new efficient multimodal sampling method that introduces new scenarios for conditional generation while only requiring a simple joint distribution to be learned. Our empirical exploration of the MS-COCO dataset demonstrates that our method generates multimodal text and image data with higher quality compared to existing multimodal diffusion models while having a comparable size, faster training, faster multimodal sampling, and more flexible generation.