SPIE: Semantic and Structural Post-Training of Image Editing Diffusion Models with AI feedback
This addresses the problem of precise and user-aligned image editing for users of diffusion models, offering a method that reduces the need for extensive human annotations.
The paper tackles the challenge of aligning instruction-based image editing diffusion models with user prompts and input image consistency, introducing SPIE which uses online reinforcement learning with AI feedback to improve alignment and realism, achieving intricate edits in complex scenes after just 10 training steps.
This paper presents SPIE: a novel approach for semantic and structural post-training of instruction-based image editing diffusion models, addressing key challenges in alignment with user prompts and consistency with input images. We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences without relying on extensive human annotations or curating a large dataset. Our method significantly improves the alignment with instructions and realism in two ways. First, SPIE captures fine nuances in the desired edit by leveraging a visual prompt, enabling detailed control over visual edits without lengthy textual prompts. Second, it achieves precise and structurally coherent modifications in complex scenes while maintaining high fidelity in instruction-irrelevant areas. This approach simplifies users' efforts to achieve highly specific edits, requiring only 5 reference images depicting a certain concept for training. Experimental results demonstrate that SPIE can perform intricate edits in complex scenes, after just 10 training steps. Finally, we showcase the versatility of our method by applying it to robotics, where targeted image edits enhance the visual realism of simulated environments, which improves their utility as proxy for real-world settings.