D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples
This addresses the issue of visual inconsistency in fine-tuning diffusion models for better text-to-image generation, which is an incremental improvement for AI and creative applications.
The paper tackles the problem of misalignment between generated images and text prompts in diffusion models by introducing D-Fusion, a method to create visually consistent samples for direct preference optimization, resulting in improved alignment as demonstrated in experiments.
The practical applications of diffusion models have been limited by the misalignment between generated images and corresponding text prompts. Recent studies have introduced direct preference optimization (DPO) to enhance the alignment of these models. However, the effectiveness of DPO is constrained by the issue of visual inconsistency, where the significant visual disparity between well-aligned and poorly-aligned images prevents diffusion models from identifying which factors contribute positively to alignment during fine-tuning. To address this issue, this paper introduces D-Fusion, a method to construct DPO-trainable visually consistent samples. On one hand, by performing mask-guided self-attention fusion, the resulting images are not only well-aligned, but also visually consistent with given poorly-aligned images. On the other hand, D-Fusion can retain the denoising trajectories of the resulting images, which are essential for DPO training. Extensive experiments demonstrate the effectiveness of D-Fusion in improving prompt-image alignment when applied to different reinforcement learning algorithms.