IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
This addresses the problem of multimodal alignment in diffusion models for AI image generation, offering a plug-and-play solution that enhances existing methods, though it is incremental as it builds on prior alignment techniques.
The paper tackles the challenge of aligning diffusion-generated images with input prompts by proposing IMG, a re-generation-based framework that uses a multimodal large language model to identify misalignments and an Implicit Aligner to correct them without extra data or editing, achieving superior performance in evaluations on models like SDXL and FLUX.
Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be available at https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment.