Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing
This work addresses a key problem in image editing for users of diffusion models, offering an incremental improvement in task-specific inversion and editing.
The paper tackles the challenge of balancing reconstruction fidelity and editability in text-guided diffusion models for real image manipulation by introducing TODInv, a framework that optimizes prompt embeddings in an extended space to achieve high-fidelity and precise editing, with experiments showing superior quantitative and qualitative performance over existing methods.
Recent advancements in text-guided diffusion models have unlocked powerful image manipulation capabilities, yet balancing reconstruction fidelity and editability for real images remains a significant challenge. In this work, we introduce \textbf{T}ask-\textbf{O}riented \textbf{D}iffusion \textbf{I}nversion (\textbf{TODInv}), a novel framework that inverts and edits real images tailored to specific editing tasks by optimizing prompt embeddings within the extended \(\mathcal{P}^*\) space. By leveraging distinct embeddings across different U-Net layers and time steps, TODInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability. This hierarchical editing mechanism categorizes tasks into structure, appearance, and global edits, optimizing only those embeddings unaffected by the current editing task. Extensive experiments on benchmark dataset reveal TODInv's superior performance over existing methods, delivering both quantitative and qualitative enhancements while showcasing its versatility with few-step diffusion model.