Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion
This addresses limitations in existing controllable T2I models for users needing precise spatial and appearance control, though it is incremental as it builds on prior methods like Ctrl-X and FreeControl.
The paper tackles the problem of preserving spatial structures and fine-grained conditions like object poses in controllable text-to-image diffusion models, proposing a training-free Dual Recursive Feedback system that improves image quality and structural consistency.
Recent advancements in controllable text-to-image (T2I) diffusion models, such as Ctrl-X and FreeControl, have demonstrated robust spatial and appearance control without requiring auxiliary module training. However, these models often struggle to accurately preserve spatial structures and fail to capture fine-grained conditions related to object poses and scene layouts. To address these challenges, we propose a training-free Dual Recursive Feedback (DRF) system that properly reflects control conditions in controllable T2I models. The proposed DRF consists of appearance feedback and generation feedback that recursively refines the intermediate latents to better reflect the given appearance information and the user's intent. This dual-update mechanism guides latent representations toward reliable manifolds, effectively integrating structural and appearance attributes. Our approach enables fine-grained generation even between class-invariant structure-appearance fusion, such as transferring human motion onto a tiger's form. Extensive experiments demonstrate the efficacy of our method in producing high-quality, semantically coherent, and structurally consistent image generations. Our source code is available at https://github.com/jwonkm/DRF.