CV AIAug 30, 2023

Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment

Zhanbo Feng, Zenan Ling, Xinyu Lu, Ci Gong, Feng Zhou, Wugedele Bao, Jie Li, Fan Yang, Robert C. Qiu

arXiv:2308.15854v31.5h-index: 46Has Code

Originality Incremental advance

AI Analysis

This work addresses the trade-off between semantic consistency and visual quality in image editing for users of diffusion models, offering an incremental improvement over existing approaches.

The paper tackles the problem of image editing with diffusion models by proposing a framework that fuses generated visual references and text guidance in a frozen pre-trained model's latent space, resulting in higher quality images and realistic editing effects across benchmark datasets compared to state-of-the-art methods.

The use of denoising diffusion models is becoming increasingly popular in the field of image editing. However, current approaches often rely on either image-guided methods, which provide a visual reference but lack control over semantic consistency, or text-guided methods, which ensure alignment with the text guidance but compromise visual quality. To resolve this issue, we propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a \textit{frozen} pre-trained diffusion model. Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt. Compared to state-of-the-art methods, the framework generates images of higher quality while providing realistic editing effects across various benchmark datasets.

View on arXiv PDF Code

Similar