CVMar 15, 2024

E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance

arXiv:2403.10133v12 citationsh-index: 15IEEE transactions on circuits and systems for video technology (Print)
Originality Incremental advance
AI Analysis

This addresses the issue of text alignment in image editing for users needing precise modifications, though it is incremental as it builds on existing diffusion and CLIP-based approaches.

The paper tackles the problem of poor editability and text alignment in diffusion-based image editing by proposing E4C, a zero-shot method that uses efficient CLIP guidance and a dual-branch pipeline, resulting in improved semantic alignment while preserving source image fidelity across various tasks.

Diffusion-based image editing is a composite process of preserving the source image content and generating new content or applying modifications. While current editing approaches have made improvements under text guidance, most of them have only focused on preserving the information of the input image, disregarding the importance of editability and alignment to the target prompt. In this paper, we prioritize the editability by proposing a zero-shot image editing method, named \textbf{E}nhance \textbf{E}ditability for text-based image \textbf{E}diting via \textbf{E}fficient \textbf{C}LIP guidance (\textbf{E4C}), which only requires inference-stage optimization to explicitly enhance the edibility and text alignment. Specifically, we develop a unified dual-branch feature-sharing pipeline that enables the preservation of the structure or texture of the source image while allowing the other to be adapted based on the editing task. We further integrate CLIP guidance into our pipeline by utilizing our novel random-gateway optimization mechanism to efficiently enhance the semantic alignment with the target prompt. Comprehensive quantitative and qualitative experiments demonstrate that our method effectively resolves the text alignment issues prevalent in existing methods while maintaining the fidelity to the source image, and performs well across a wide range of editing tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes