ITVTON: Virtual Try-On Diffusion Transformer Based on Integrated Image and Text
This work addresses computational inefficiency in virtual try-on for fashion and e-commerce applications, representing an incremental improvement.
The paper tackles the problem of virtual try-on by proposing ITVTON, an efficient diffusion transformer framework that reduces computational overhead while improving image fidelity, achieving superior performance over baseline methods on 10,257 image pairs.
Virtual try-on, which aims to seamlessly fit garments onto person images, has recently seen significant progress with diffusion-based models. However, existing methods commonly resort to duplicated backbones or additional image encoders to extract garment features, which increases computational overhead and network complexity. In this paper, we propose ITVTON, an efficient framework that leverages the Diffusion Transformer (DiT) as its single generator to improve image fidelity. By concatenating garment and person images along the width dimension and incorporating textual descriptions from both, ITVTON effectively captures garment-person interactions while preserving realism. To further reduce computational cost, we restrict training to the attention parameters within a single Diffusion Transformer (Single-DiT) block. Extensive experiments demonstrate that ITVTON surpasses baseline methods both qualitatively and quantitatively, setting a new standard for virtual try-on. Moreover, experiments on 10,257 image pairs from IGPair confirm its robustness in real-world scenarios.