CVApr 25, 2025

Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Preference Understanding

Kun Li, Jianhui Wang, Yangfan He, Xinyuan Song, Ruoyu Wang, Hongyang He, Wenxin Zhang, Jiaqi Chen, Keqin Li, Sida Li, Miao Zhang, Tianyu Shi

arXiv:2504.18204v13.6h-index: 12

Originality Incremental advance

AI Analysis

This addresses the challenge of meeting user preferences in text-to-image generation, though it appears incremental as it builds on existing reward feedback methods.

The paper tackles the problem of aligning generated images with fine-grained user preferences in multi-round interactions by introducing a Visual Co-Adaptation framework that incorporates human feedback and multiple reward functions. Experiments show it outperforms state-of-the-art baselines, significantly improving image consistency and user satisfaction in multi-turn dialogues.

Generative AI has significantly changed industries by enabling text-driven image generation, yet challenges remain in achieving high-resolution outputs that align with fine-grained user preferences. Consequently, multi-round interactions are necessary to ensure the generated images meet expectations. Previous methods enhanced prompts via reward feedback but did not optimize over a multi-round dialogue dataset. In this work, we present a Visual Co-Adaptation (VCA) framework incorporating human-in-the-loop feedback, leveraging a well-trained reward model aligned with human preferences. Using a diverse multi-turn dialogue dataset, our framework applies multiple reward functions, such as diversity, consistency, and preference feedback, while fine-tuning the diffusion model through LoRA, thus optimizing image generation based on user input. We also construct multi-round dialogue datasets of prompts and image pairs aligned with user intent. Experiments demonstrate that our method outperforms state-of-the-art baselines, significantly improving image consistency and alignment with user intent. Our approach consistently surpasses competing models in user satisfaction, especially in multi-turn dialogue scenarios.

View on arXiv PDF

Similar