CVNov 5, 2024

DiT4Edit: Diffusion Transformer for Image Editing

Kunyu Feng, Yue Ma, Bingyuan Wang, Chenyang Qi, Haozhe Chen, Qifeng Chen, Zeyu Wang

arXiv:2411.03286v232.6102 citationsh-index: 12AAAI

Originality Incremental advance

AI Analysis

This work addresses image editing challenges for high-resolution and arbitrary-size images, representing an incremental improvement by adapting Diffusion Transformers to editing tasks.

The paper tackles the problem of shape-aware object editing in high-resolution images by proposing DiT4Edit, a Diffusion Transformer-based framework that generates higher-quality edited images faster, reducing inversion steps and outperforming UNet-based methods.

Despite recent advances in UNet-based image editing, methods for shape-aware object editing in high-resolution images are still lacking. Compared to UNet, Diffusion Transformers (DiT) demonstrate superior capabilities to effectively capture the long-range dependencies among patches, leading to higher-quality image generation. In this paper, we propose DiT4Edit, the first Diffusion Transformer-based image editing framework. Specifically, DiT4Edit uses the DPM-Solver inversion algorithm to obtain the inverted latents, reducing the number of steps compared to the DDIM inversion algorithm commonly used in UNet-based frameworks. Additionally, we design unified attention control and patches merging, tailored for transformer computation streams. This integration allows our framework to generate higher-quality edited images faster. Our design leverages the advantages of DiT, enabling it to surpass UNet structures in image editing, especially in high-resolution and arbitrary-size images. Extensive experiments demonstrate the strong performance of DiT4Edit across various editing scenarios, highlighting the potential of Diffusion Transformers in supporting image editing.

View on arXiv PDF

Similar