CVApr 25, 2025

Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Preference Understanding

arXiv:2504.18204v1h-index: 12
Originality Incremental advance
AI Analysis

This addresses the challenge of meeting user preferences in text-to-image generation, though it appears incremental as it builds on existing reward feedback methods.

The paper tackles the problem of aligning generated images with fine-grained user preferences in multi-round interactions by introducing a Visual Co-Adaptation framework that incorporates human feedback and multiple reward functions. Experiments show it outperforms state-of-the-art baselines, significantly improving image consistency and user satisfaction in multi-turn dialogues.

Generative AI has significantly changed industries by enabling text-driven image generation, yet challenges remain in achieving high-resolution outputs that align with fine-grained user preferences. Consequently, multi-round interactions are necessary to ensure the generated images meet expectations. Previous methods enhanced prompts via reward feedback but did not optimize over a multi-round dialogue dataset. In this work, we present a Visual Co-Adaptation (VCA) framework incorporating human-in-the-loop feedback, leveraging a well-trained reward model aligned with human preferences. Using a diverse multi-turn dialogue dataset, our framework applies multiple reward functions, such as diversity, consistency, and preference feedback, while fine-tuning the diffusion model through LoRA, thus optimizing image generation based on user input. We also construct multi-round dialogue datasets of prompts and image pairs aligned with user intent. Experiments demonstrate that our method outperforms state-of-the-art baselines, significantly improving image consistency and alignment with user intent. Our approach consistently surpasses competing models in user satisfaction, especially in multi-turn dialogue scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes