LPH-VTON: Resolving the Structure-Texture Dilemma of Virtual Try-On via Latent Process Handover
For virtual try-on applications, this work provides a practical solution to the structure-texture dilemma, enabling higher-quality garment rendering.
Virtual Try-On faces a trade-off between structural integrity and textural fidelity. LPH-VTON resolves this by decomposing generation into a structure-biased early stage and a texture-biased later stage, achieving superior perceptual faithfulness and competitive structural alignment on VITON-HD.
Virtual Try-On (VTON) aims to synthesize photorealistic images of garments precisely aligned with a person's body and pose. Current diffusion-based methods, however, face a fundamental trade-off between structural integrity and textural fidelity. In this paper, we formalize this challenge as a consequence of complementary inductive biases inherent in prevailing architectures: models heavily reliant on spatial constraints naturally favor geometric alignment but often suppress textures, whereas models dominated by unconstrained generative priors excel at vibrant detail rendering but are prone to structural drift. Based on this diagnosis, we propose LPH-VTON, a new synergistic framework that resolves this tension within a single, continuous denoising process. LPH-VTON strategically decomposes the generation, leveraging a structure-biased model to establish a geometrically consistent latent scaffold in the early stages, before handing over control to a texture-biased model for high-fidelity detail rendering. Extensive experiments validate our approach. Our model achieves a superior Pareto-optimal balance, establishing new benchmarks in perceptual faithfulness while maintaining highly competitive structural alignment across the standard dataset VITON-HD, proving the efficacy of temporal architectural decoupling.