Xingzi Xu

CV
h-index27
8papers
22citations
Novelty50%
AI Score43

8 Papers

LGJul 17, 2024
Base Models for Parabolic Partial Differential Equations

Xingzi Xu, Ali Hasan, Jie Ding et al.

Parabolic partial differential equations (PDEs) appear in many disciplines to model the evolution of various mathematical objects, such as probability flows, value functions in control theory, and derivative prices in finance. It is often necessary to compute the solutions or a function of the solutions to a parametric PDE in multiple scenarios corresponding to different parameters of this PDE. This process often requires resolving the PDEs from scratch, which is time-consuming. To better employ existing simulations for the PDEs, we propose a framework for finding solutions to parabolic PDEs across different scenarios by meta-learning an underlying base distribution. We build upon this base distribution to propose a method for computing solutions to parametric PDEs under different parameter settings. Finally, we illustrate the application of the proposed methods through extensive experiments in generative modeling, stochastic control, and finance. The empirical results suggest that the proposed approach improves generalization to solving PDEs under new parameter regimes.

CVOct 3, 2025
DiT-VTON: Diffusion Transformer Framework for Unified Multi-Category Virtual Try-On and Virtual Try-All with Integrated Image Editing

Qi Li, Shuwen Qiu, Julien Han et al.

The rapid growth of e-commerce has intensified the demand for Virtual Try-On (VTO) technologies, enabling customers to realistically visualize products overlaid on their own images. Despite recent advances, existing VTO models face challenges with fine-grained detail preservation, robustness to real-world imagery, efficient sampling, image editing capabilities, and generalization across diverse product categories. In this paper, we present DiT-VTON, a novel VTO framework that leverages a Diffusion Transformer (DiT), renowned for its performance on text-conditioned image generation, adapted here for the image-conditioned VTO task. We systematically explore multiple DiT configurations, including in-context token concatenation, channel concatenation, and ControlNet integration, to determine the best setup for VTO image conditioning. To enhance robustness, we train the model on an expanded dataset encompassing varied backgrounds, unstructured references, and non-garment categories, demonstrating the benefits of data scaling for VTO adaptability. DiT-VTON also redefines the VTO task beyond garment try-on, offering a versatile Virtual Try-All (VTA) solution capable of handling a wide range of product categories and supporting advanced image editing functionalities such as pose preservation, localized editing, texture transfer, and object-level customization. Experimental results show that our model surpasses state-of-the-art methods on VITON-HD, achieving superior detail preservation and robustness without reliance on additional condition encoders. It also outperforms models with VTA and image editing capabilities on a diverse dataset spanning thousands of product categories.

CVSep 24, 2025
InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On

Julien Han, Shuwen Qiu, Qi Li et al.

We present InstructVTON, an instruction-following interactive virtual try-on system that allows fine-grained and complex styling control of the resulting generation, guided by natural language, on single or multiple garments. A computationally efficient and scalable formulation of virtual try-on formulates the problem as an image-guided or image-conditioned inpainting task. These inpainting-based virtual try-on models commonly use a binary mask to control the generation layout. Producing a mask that yields desirable result is difficult, requires background knowledge, might be model dependent, and in some cases impossible with the masking-based approach (e.g. trying on a long-sleeve shirt with "sleeves rolled up" styling on a person wearing long-sleeve shirt with sleeves down, where the mask will necessarily cover the entire sleeve). InstructVTON leverages Vision Language Models (VLMs) and image segmentation models for automated binary mask generation. These masks are generated based on user-provided images and free-text style instructions. InstructVTON simplifies the end-user experience by removing the necessity of a precisely drawn mask, and by automating execution of multiple rounds of image generation for try-on scenarios that cannot be achieved with masking-based virtual try-on models alone. We show that InstructVTON is interoperable with existing virtual try-on models to achieve state-of-the-art results with styling control.

CVSep 16, 2025
DEFT-VTON: Efficient Virtual Try-On with Consistent Generalised H-Transform

Xingzi Xu, Qi Li, Shuwen Qiu et al.

Diffusion models enable high-quality virtual try-on (VTO) with their established image synthesis abilities. Despite the extensive end-to-end training of large pre-trained models involved in current VTO methods, real-world applications often prioritize limited training and inference, serving, and deployment budgets for VTO. To solve this obstacle, we apply Doob's h-transform efficient fine-tuning (DEFT) for adapting large pre-trained unconditional models for downstream image-conditioned VTO abilities. DEFT freezes the pre-trained model's parameters and trains a small h-transform network to learn a conditional h-transform. The h-transform network allows training only 1.42 percent of the frozen parameters, compared to a baseline of 5.52 percent in traditional parameter-efficient fine-tuning (PEFT). To further improve DEFT's performance and decrease existing models' inference time, we additionally propose an adaptive consistency loss. Consistency training distills slow but high-performing diffusion models into a fast one while retaining performance by enforcing consistencies along the inference path. Inspired by constrained optimization, instead of distillation, we combine the consistency loss and the denoising score matching loss in a data-adaptive manner for fine-tuning existing VTO models at a low cost. Empirical results show the proposed DEFT-VTON method achieves state-of-the-art performance on VTO tasks, with as few as 15 denoising steps, while maintaining competitive results.

CVSep 24, 2025
Efficient Encoder-Free Pose Conditioning and Pose Control for Virtual Try-On

Qi Li, Shuwen Qiu, Julien Han et al.

As online shopping continues to grow, the demand for Virtual Try-On (VTON) technology has surged, allowing customers to visualize products on themselves by overlaying product images onto their own photos. An essential yet challenging condition for effective VTON is pose control, which ensures accurate alignment of products with the user's body while supporting diverse orientations for a more immersive experience. However, incorporating pose conditions into VTON models presents several challenges, including selecting the optimal pose representation, integrating poses without additional parameters, and balancing pose preservation with flexible pose control. In this work, we build upon a baseline VTON model that concatenates the reference image condition without external encoder, control network, or complex attention layers. We investigate methods to incorporate pose control into this pure concatenation paradigm by spatially concatenating pose data, comparing performance using pose maps and skeletons, without adding any additional parameters or module to the baseline model. Our experiments reveal that pose stitching with pose maps yields the best results, enhancing both pose preservation and output realism. Additionally, we introduce a mixed-mask training strategy using fine-grained and bounding box masks, allowing the model to support flexible product integration across varied poses and conditions.

LGFeb 22, 2025
C2-DPO: Constrained Controlled Direct Preference Optimization

Kavosh Asadi, Julien Han, Idan Pipano et al.

Direct preference optimization (\texttt{DPO}) has emerged as a promising approach for solving the alignment problem in AI. In this paper, we make two counter-intuitive observations about \texttt{DPO}. First, we show that \texttt{DPO} loss could be derived by starting from an alternative optimization problem that only defines the KL guardrail on in-sample responses, unlike the original RLHF problem where guardrails are defined on the entire distribution. Second, we prove a surprising property of this alternative optimization problem, namely that under its optimal policy, both preferred and rejected responses tend to decrease in probability, a phenomenon typically displayed by DPO in practice. To control this behavior, we propose a set of constraints designed to limit the displacement of probability mass between the preferred and rejected responses in the reference and target policies. The resulting algorithm, which we call Constrained Controlled DPO (\texttt{C2-DPO}), has a meaningful RLHF interpretation. By hedging against the displacement, \texttt{C2-DPO} provides practical improvements over vanilla \texttt{DPO} when aligning several language models using standard preference datasets.

LGJan 1, 2025
Adjoint sharding for very long context training of state space models

Xingzi Xu, Amir Tavanaei, Kavosh Asadi et al.

Despite very fast progress, efficiently training large language models (LLMs) in very long contexts remains challenging. Existing methods fall back to training LLMs with short contexts (a maximum of a few thousands tokens in training) and use inference time techniques when evaluating on long contexts (above 1M tokens context window at inference). As opposed to long-context-inference, training on very long context input prompts is quickly limited by GPU memory availability and by the prohibitively long training times it requires on state-of-the-art hardware. Meanwhile, many real-life applications require not only inference but also training/fine-tuning with long context on specific tasks. Such applications include, for example, augmenting the context with various sources of raw reference information for fact extraction, fact summarization, or fact reconciliation tasks. We propose adjoint sharding, a novel technique that comprises sharding gradient calculation during training to reduce memory requirements by orders of magnitude, making training on very long context computationally tractable. Adjoint sharding is based on the adjoint method and computes equivalent gradients to backpropagation. We also propose truncated adjoint sharding to speed up the algorithm while maintaining performance. We provide a distributed version, and a paralleled version of adjoint sharding to further speed up training. Empirical results show the proposed adjoint sharding algorithm reduces memory usage by up to 3X with a 1.27B parameter large language model on 1M context length training. This allows to increase the maximum context length during training or fine-tuning of a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances.

LGNov 25, 2021
Characteristic Neural Ordinary Differential Equations

Xingzi Xu, Ali Hasan, Khalil Elkhalil et al.

We propose Characteristic-Neural Ordinary Differential Equations (C-NODEs), a framework for extending Neural Ordinary Differential Equations (NODEs) beyond ODEs. While NODEs model the evolution of a latent variables as the solution to an ODE, C-NODE models the evolution of the latent variables as the solution of a family of first-order quasi-linear partial differential equations (PDEs) along curves on which the PDEs reduce to ODEs, referred to as characteristic curves. This in turn allows the application of the standard frameworks for solving ODEs, namely the adjoint method. Learning optimal characteristic curves for given tasks improves the performance and computational efficiency, compared to state of the art NODE models. We prove that the C-NODE framework extends the classical NODE on classification tasks by demonstrating explicit C-NODE representable functions not expressible by NODEs. Additionally, we present C-NODE-based continuous normalizing flows, which describe the density evolution of latent variables along multiple dimensions. Empirical results demonstrate the improvements provided by the proposed method for classification and density estimation on CIFAR-10, SVHN, and MNIST datasets under a similar computational budget as the existing NODE methods. The results also provide empirical evidence that the learned curves improve the efficiency of the system through a lower number of parameters and function evaluations compared with baselines.