CVLGFeb 13, 2025

E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot Object Customization

arXiv:2502.09164v12 citationsh-index: 12
Originality Highly original
AI Analysis

This work addresses the problem of efficient image customization for computer vision applications, providing a significant improvement for researchers and developers working on image generation and editing tasks.

The authors tackled the problem of efficient zero-shot object image customization, achieving a 2.5x faster inference speed and using 2/3 of the GPU memory compared to a Unet-based model, with only 1/4 of the parameters. The proposed E-MD3C framework outperformed existing approaches on the VITON-HD dataset across various metrics.

We propose E-MD3C ($\underline{E}$fficient $\underline{M}$asked $\underline{D}$iffusion Transformer with Disentangled $\underline{C}$onditions and $\underline{C}$ompact $\underline{C}$ollector), a highly efficient framework for zero-shot object image customization. Unlike prior works reliant on resource-intensive Unet architectures, our approach employs lightweight masked diffusion transformers operating on latent patches, offering significantly improved computational efficiency. The framework integrates three core components: (1) an efficient masked diffusion transformer for processing autoencoder latents, (2) a disentangled condition design that ensures compactness while preserving background alignment and fine details, and (3) a learnable Conditions Collector that consolidates multiple inputs into a compact representation for efficient denoising and learning. E-MD3C outperforms the existing approach on the VITON-HD dataset across metrics such as PSNR, FID, SSIM, and LPIPS, demonstrating clear advantages in parameters, memory efficiency, and inference speed. With only $\frac{1}{4}$ of the parameters, our Transformer-based 468M model delivers $2.5\times$ faster inference and uses $\frac{2}{3}$ of the GPU memory compared to an 1720M Unet-based latent diffusion model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes