ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation
This work addresses the problem of real-image editing for users of Rectified Flow models, offering an incremental improvement over existing methods.
The paper tackles the challenge of adapting Rectified Flow models for real-image editing by proposing a training-free method that extracts mid-step features from real images and adapts attention during injection, achieving superior performance over nine baselines on two benchmarks with strong user preference in human evaluations.
Rectified Flow text-to-image models surpass diffusion models in image quality and text alignment, but adapting ReFlow for real-image editing remains challenging. We propose a new real-image editing method for ReFlow by analyzing the intermediate representations of multimodal transformer blocks and identifying three key features. To extract these features from real images with sufficient structural preservation, we leverage mid-step latent, which is inverted only up to the mid-step. We then adapt attention during injection to improve editability and enhance alignment to the target text. Our method is training-free, requires no user-provided mask, and can be applied even without a source prompt. Extensive experiments on two benchmarks with nine baselines demonstrate its superior performance over prior methods, further validated by human evaluations confirming a strong user preference for our approach.