Jiayang Xu

CV
h-index19
11papers
243citations
Novelty53%
AI Score52

11 Papers

FLU-DYNFeb 5, 2019
Reduced Order Modeling Framework for Combustor Instabilities Using Truncated Domain Training

Jiayang Xu, Cheng Huang, Karthik Duraisamy

A multi-fidelity framework is established and demonstrated for prediction of combustion instabilities in rocket engines. The major idea is to adapt appropriate fidelity modeling approaches for different components in a rocket engine to ensure accurate and efficient predictions. Specifically, the proposed framework integrates projection-based Reduced Order Models (ROMs) that are developed using bases generated on truncated domain simulations. The ROM training is performed on truncated domains, and thus does not require full order model solutions on the full rocket geometry, thus demonstrating the potential to greatly reduce training cost. Geometry-specific training is replaced by the response generated by perturbing the characteristics at the boundary of the truncated domain. This training method is shown to enhance predictive capabilities and robustness of the resulting ROMs, including at conditions outside the training range. Numerical tests are conducted on a quasi-1D model of a single-element rocket combustor and the present framework is compared to traditional ROM development approaches.

CLAug 28, 2024
Legilimens: Practical and Unified Content Moderation for Large Language Model Services

Jialin Wu, Jiangyi Deng, Shengyuan Pang et al.

Given the societal impact of unsafe content generated by large language models (LLMs), ensuring that LLM services comply with safety standards is a crucial concern for LLM service providers. Common content moderation methods are limited by an effectiveness-and-efficiency dilemma, where simple models are fragile while sophisticated models consume excessive computational resources. In this paper, we reveal for the first time that effective and efficient content moderation can be achieved by extracting conceptual features from chat-oriented LLMs, despite their initial fine-tuning for conversation rather than content moderation. We propose a practical and unified content moderation framework for LLM services, named Legilimens, which features both effectiveness and efficiency. Our red-team model-based data augmentation enhances the robustness of Legilimens against state-of-the-art jailbreaking. Additionally, we develop a framework to theoretically analyze the cost-effectiveness of Legilimens compared to other methods. We have conducted extensive experiments on five host LLMs, seventeen datasets, and nine jailbreaking methods to verify the effectiveness, efficiency, and robustness of Legilimens against normal and adaptive adversaries. A comparison of Legilimens with both commercial and academic baselines demonstrates the superior performance of Legilimens. Furthermore, we confirm that Legilimens can be applied to few-shot scenarios and extended to multi-label classification tasks.

CVJan 9
Orient Anything V2: Unifying Orientation and Rotation Understanding

Zehan Wang, Ziang Zhang, Jiayang Xu et al.

This work presents Orient Anything V2, an enhanced foundation model for unified understanding of object 3D orientation and rotation from single or paired images. Building upon Orient Anything V1, which defines orientation via a single unique front face, V2 extends this capability to handle objects with diverse rotational symmetries and directly estimate relative rotations. These improvements are enabled by four key innovations: 1) Scalable 3D assets synthesized by generative models, ensuring broad category coverage and balanced data distribution; 2) An efficient, model-in-the-loop annotation system that robustly identifies 0 to N valid front faces for each object; 3) A symmetry-aware, periodic distribution fitting objective that captures all plausible front-facing orientations, effectively modeling object rotational symmetry; 4) A multi-frame architecture that directly predicts relative object rotations. Extensive experiments show that Orient Anything V2 achieves state-of-the-art zero-shot performance on orientation estimation, 6DoF pose estimation, and object symmetry recognition across 11 widely used benchmarks. The model demonstrates strong generalization, significantly broadening the applicability of orientation estimation in diverse downstream tasks.

CVMay 30, 2025
GenSpace: Benchmarking Spatially-Aware Image Generation

Zehan Wang, Jiayang Xu, Ziang Zhang et al.

Humans can intuitively compose and arrange scenes in the 3D space for photography. However, can advanced AI image generators plan scenes with similar 3D spatial awareness when creating images from text or image prompts? We present GenSpace, a novel benchmark and evaluation pipeline to comprehensively assess the spatial awareness of current image generation models. Furthermore, standard evaluations using general Vision-Language Models (VLMs) frequently fail to capture the detailed spatial errors. To handle this challenge, we propose a specialized evaluation pipeline and metric, which reconstructs 3D scene geometry using multiple visual foundation models and provides a more accurate and human-aligned metric of spatial faithfulness. Our findings show that while AI models create visually appealing images and can follow general instructions, they struggle with specific 3D details like object placement, relationships, and measurements. We summarize three core limitations in the spatial perception of current state-of-the-art image generation models: 1) Object Perspective Understanding, 2) Egocentric-Allocentric Transformation and 3) Metric Measurement Adherence, highlighting possible directions for improving spatial intelligence in image generation.

CVApr 9
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

Jiayang Xu, Fan Zhuo, Majun Zhang et al.

Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient framework that learns video editing capabilities entirely from image pairs. By freezing the pre-trained 3D attention modules and treating images as single-frame videos, we decouple the 2D spatial learning process to help preserve the original temporal dynamics. The core of our approach is a Predict-Update Spatial Difference Attention module that progressively extracts and injects spatial differences. Rather than relying on rigid external masks, we incorporate a Text-Guided Dynamic Semantic Gating mechanism for adaptive and implicit text-driven modifications. Despite training on only 13K image pairs for 5 epochs with exceptionally low computational overhead, ImVideoEdit achieves editing fidelity and temporal consistency comparable to larger models trained on extensive video datasets.

CVApr 3, 2025
OmniCam: Unified Multimodal Video Generation via Camera Control

Xiaoda Yang, Jiayang Xu, Kaixuan Luan et al.

Camera control, which achieves diverse visual effects by changing camera position and pose, has attracted widespread attention. However, existing methods face challenges such as complex interaction and limited control capabilities. To address these issues, we present OmniCam, a unified multimodal camera control framework. Leveraging large language models and video diffusion models, OmniCam generates spatio-temporally consistent videos. It supports various combinations of input modalities: the user can provide text or video with expected trajectory as camera path guidance, and image or video as content reference, enabling precise control over camera motion. To facilitate the training of OmniCam, we introduce the OmniTr dataset, which contains a large collection of high-quality long-sequence trajectories, videos, and corresponding descriptions. Experimental results demonstrate that our model achieves state-of-the-art performance in high-quality camera-controlled video generation across various metrics.

CVMar 12, 2025
Astrea: A MOE-based Visual Understanding Model with Progressive Alignment

Xiaoda Yang, JunYu Lu, Hongshun Qiu et al.

Vision-Language Models (VLMs) based on Mixture-of-Experts (MoE) architectures have emerged as a pivotal paradigm in multimodal understanding, offering a powerful framework for integrating visual and linguistic information. However, the increasing complexity and diversity of tasks present significant challenges in coordinating load balancing across heterogeneous visual experts, where optimizing one specialist's performance often compromises others' capabilities. To address task heterogeneity and expert load imbalance, we propose Astrea, a novel multi-expert collaborative VLM architecture based on progressive pre-alignment. Astrea introduces three key innovations: 1) A heterogeneous expert coordination mechanism that integrates four specialized models (detection, segmentation, classification, captioning) into a comprehensive expert matrix covering essential visual comprehension elements; 2) A dynamic knowledge fusion strategy featuring progressive pre-alignment to harmonize experts within the VLM latent space through contrastive learning, complemented by probabilistically activated stochastic residual connections to preserve knowledge continuity; 3) An enhanced optimization framework utilizing momentum contrastive learning for long-range dependency modeling and adaptive weight allocators for real-time expert contribution calibration. Extensive evaluations across 12 benchmark tasks spanning VQA, image captioning, and cross-modal retrieval demonstrate Astrea's superiority over state-of-the-art models, achieving an average performance gain of +4.7\%. This study provides the first empirical demonstration that progressive pre-alignment strategies enable VLMs to overcome task heterogeneity limitations, establishing new methodological foundations for developing general-purpose multimodal agents.

NIOct 1, 2025
Make a Video Call with LLM: A Measurement Campaign over Five Mainstream Apps

Jiayang Xu, Xiangjie Huang, Zijie Li et al.

In 2025, Large Language Model (LLM) services have launched a new feature -- AI video chat -- allowing users to interact with AI agents via real-time video communication (RTC), just like chatting with real people. Despite its significance, no systematic study has characterized the performance of existing AI video chat systems. To address this gap, this paper proposes a comprehensive benchmark with carefully designed metrics across four dimensions: quality, latency, internal mechanisms, and system overhead. Using custom testbeds, we further evaluate five mainstream AI video chatbots with this benchmark. This work provides the research community a baseline of real-world performance and identifies unique system bottlenecks. In the meantime, our benchmarking results also open up several research questions for future optimizations of AI video chatbots.

AISep 22, 2025
Memory-QA: Answering Recall Questions Based on Multimodal Memories

Hongda Jiang, Xinyuan Zhang, Siddhant Garg et al. · amazon-science

We introduce Memory-QA, a novel real-world task that involves answering recall questions about visual content from previously stored multimodal memories. This task poses unique challenges, including the creation of task-oriented memories, the effective utilization of temporal and location information within memories, and the ability to draw upon multiple memories to answer a recall question. To address these challenges, we propose a comprehensive pipeline, Pensieve, integrating memory-specific augmentation, time- and location-aware multi-signal retrieval, and multi-memory QA fine-tuning. We created a multimodal benchmark to illustrate various real challenges in this task, and show the superior performance of Pensieve over state-of-the-art solutions (up to 14% on QA accuracy).

LGSep 15, 2021
Conditionally Parameterized, Discretization-Aware Neural Networks for Mesh-Based Modeling of Physical Systems

Jiayang Xu, Aniruddhe Pradhan, Karthik Duraisamy

Simulations of complex physical systems are typically realized by discretizing partial differential equations (PDEs) on unstructured meshes. While neural networks have recently been explored for surrogate and reduced order modeling of PDE solutions, they often ignore interactions or hierarchical relations between input features, and process them as concatenated mixtures. We generalize the idea of conditional parameterization -- using trainable functions of input parameters to generate the weights of a neural network, and extend them in a flexible way to encode critical information. Inspired by discretized numerical methods, choices of the parameters include physical quantities and mesh topology features. The functional relation between the modeled features and the parameters is built into the network architecture. The method is implemented on different networks and applied to frontier scientific machine learning tasks including the discovery of unmodeled physics, super-resolution of coarse fields, and the simulation of unsteady flows with chemical reactions. The results show that the conditionally-parameterized networks provide superior performance compared to their traditional counterparts. The CP-GNet - an architecture that can be trained on very few data snapshots - is proposed as the first deep learning model capable of standalone prediction of reacting flows on irregular meshes.

COMP-PHDec 23, 2019
Multi-level Convolutional Autoencoder Networks for Parametric Prediction of Spatio-temporal Dynamics

Jiayang Xu, Karthik Duraisamy

A data-driven framework is proposed towards the end of predictive modeling of complex spatio-temporal dynamics, leveraging nested non-linear manifolds. Three levels of neural networks are used, with the goal of predicting the future state of a system of interest in a parametric setting. A convolutional autoencoder is used as the top level to encode the high dimensional input data along spatial dimensions into a sequence of latent variables. A temporal convolutional autoencoder (TCAE) serves as the second level, which further encodes the output sequence from the first level along the temporal dimension, and outputs a set of latent variables that encapsulate the spatio-temporal evolution of the dynamics. The use of dilated temporal convolutions grows the receptive field exponentially with network depth, allowing for efficient processing of long temporal sequences typical of scientific computations. A fully-connected network is used as the third level to learn the mapping between these latent variables and the global parameters from training data, and predict them for new parameters. For future state predictions, the second level uses a temporal convolutional network to predict subsequent steps of the output sequence from the top level. Latent variables at the bottom-most level are decoded to obtain the dynamics in physical space at new global parameters and/or at a future time. Predictive capabilities are evaluated on a range of problems involving discontinuities, wave propagation, strong transients, and coherent structures. The sensitivity of the results to different modeling choices is assessed. The results suggest that given adequate data and careful training, effective data-driven predictive models can be constructed. Perspectives are provided on the present approach and its place in the landscape of model reduction.