Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance
This work addresses the need for efficient and flexible fine-grained control in text-to-image generation for users, though it is incremental as it builds on prior methods like FreeControl and Diffusion Self-Guidance.
The paper tackles the problem of slow and inflexible controllable text-to-image generation by introducing Ctrl-X, a framework that achieves structure alignment and appearance transfer without extra training or guidance, resulting in superior image quality and faster performance compared to existing methods.
Recent controllable generation approaches such as FreeControl and Diffusion Self-Guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexibility and use. This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance. Ctrl-X designs feed-forward structure control to enable the structure alignment with a structure image and semantic-aware appearance transfer to facilitate the appearance transfer from a user-input image. Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model checkpoints. In particular, Ctrl-X supports novel structure and appearance control with arbitrary condition images of any modality, exhibits superior image quality and appearance transfer compared to existing works, and provides instant plug-and-play functionality to any T2I and text-to-video (T2V) diffusion model. See our project page for an overview of the results: https://genforce.github.io/ctrl-x