CVJan 7, 2025

MC-VTON: Minimal Control Virtual Try-On Diffusion Transformer

Junsheng Luan, Guangyuan Li, Lei Zhao, Wei Xing

arXiv:2501.03630v213.16 citationsh-index: 10

Originality Incremental advance

AI Analysis

This work addresses efficiency and simplicity in virtual try-on for fashion and e-commerce applications, though it is incremental as it builds on existing diffusion transformer frameworks.

The paper tackles the problem of high complexity and computational cost in virtual try-on diffusion models by introducing MC-VTON, a method that uses a diffusion transformer to integrate minimal conditional inputs, resulting in superior detail fidelity with only 8 inference steps and 86.8M additional parameters.

Virtual try-on methods based on diffusion models achieve realistic try-on effects. They use an extra reference network or an additional image encoder to process multiple conditional image inputs, which adds complexity pre-processing and additional computational costs. Besides, they require more than 25 inference steps, bringing longer inference time. In this work, with the development of diffusion transformer (DiT), we rethink the necessity of additional reference network or image encoder and introduce MC-VTON, which leverages DiT's intrinsic backbone to seamlessly integrate minimal conditional try-on inputs. Compared to existing methods, the superiority of MC-VTON is demonstrated in four aspects: (1) Superior detail fidelity. Our DiT-based MC-VTON exhibits superior fidelity in preserving fine-grained details. (2) Simplified network and inputs. We remove any extra reference network or image encoder. We also remove unnecessary conditions like the long prompt, pose estimation, human parsing, and depth map. We require only the masked person image and the garment image. (3) Parameter-efficient training. To process the try-on task, we fine-tune the FLUX.1-dev with only 39.7M additional parameters (0.33% of the backbone parameters). (4) Less inference steps. We apply distillation diffusion on MC-VTON and only need 8 steps to generate a realistic try-on image, with only 86.8M additional parameters (0.72% of the backbone parameters). Experiments show that MC-VTON achieves superior qualitative and quantitative results with fewer condition inputs, trainable parameters, and inference steps than baseline methods.

View on arXiv PDF

Similar