CVAIAug 30, 2023

Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment

arXiv:2308.15854v3h-index: 5
Originality Incremental advance
AI Analysis

This work addresses the trade-off between semantic consistency and visual quality in image editing for users of diffusion models, offering an incremental improvement over existing approaches.

The paper tackles the problem of image editing with diffusion models by proposing a framework that fuses generated visual references and text guidance in a frozen pre-trained model's latent space, resulting in higher quality images and realistic editing effects across benchmark datasets compared to state-of-the-art methods.

The use of denoising diffusion models is becoming increasingly popular in the field of image editing. However, current approaches often rely on either image-guided methods, which provide a visual reference but lack control over semantic consistency, or text-guided methods, which ensure alignment with the text guidance but compromise visual quality. To resolve this issue, we propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a \textit{frozen} pre-trained diffusion model. Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt. Compared to state-of-the-art methods, the framework generates images of higher quality while providing realistic editing effects across various benchmark datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes