CVNov 5, 2024

DiT4Edit: Diffusion Transformer for Image Editing

arXiv:2411.03286v2102 citationsh-index: 12AAAI
Originality Incremental advance
AI Analysis

This work addresses image editing challenges for high-resolution and arbitrary-size images, representing an incremental improvement by adapting Diffusion Transformers to editing tasks.

The paper tackles the problem of shape-aware object editing in high-resolution images by proposing DiT4Edit, a Diffusion Transformer-based framework that generates higher-quality edited images faster, reducing inversion steps and outperforming UNet-based methods.

Despite recent advances in UNet-based image editing, methods for shape-aware object editing in high-resolution images are still lacking. Compared to UNet, Diffusion Transformers (DiT) demonstrate superior capabilities to effectively capture the long-range dependencies among patches, leading to higher-quality image generation. In this paper, we propose DiT4Edit, the first Diffusion Transformer-based image editing framework. Specifically, DiT4Edit uses the DPM-Solver inversion algorithm to obtain the inverted latents, reducing the number of steps compared to the DDIM inversion algorithm commonly used in UNet-based frameworks. Additionally, we design unified attention control and patches merging, tailored for transformer computation streams. This integration allows our framework to generate higher-quality edited images faster. Our design leverages the advantages of DiT, enabling it to surpass UNet structures in image editing, especially in high-resolution and arbitrary-size images. Extensive experiments demonstrate the strong performance of DiT4Edit across various editing scenarios, highlighting the potential of Diffusion Transformers in supporting image editing.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes