Xiang Yang

CV
h-index11
12papers
52citations
Novelty45%
AI Score54

12 Papers

58.1ROApr 13Code
RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

Shihan Wu, Xuecheng Liu, Shaoxuan Xie et al.

Despite the critical role of bimanual manipulation in endowing robots with human-like dexterity, large-scale and diverse datasets remain scarce due to the significant hardware heterogeneity across bimanual robotic platforms. To bridge this gap, we introduce RoboCOIN, a large-scale multi-embodiment bimanual manipulation dataset comprising over 180,000 demonstrations collected from 15 distinct robotic platforms. Spanning 16 diverse environments-including residential, commercial, and industrial settings-the dataset features 421 bimanual tasks systematically categorized by 39 bimanual collaboration actions and 432 objects. A key innovation of our work is the hierarchical capability pyramid, which provides granular annotations ranging from trajectory-level concepts to segment-level subtasks and frame-level kinematics. Furthermore, we present CoRobot, an efficient data processing pipeline powered by the Robot Trajectory Markup Language (RTML), designed to facilitate quality assessment, automated annotation, and unified multi-embodiment and data management. Extensive experiments demonstrate the effectiveness of RoboCOIN in enhancing the performance of various bimanual manipulation models across a wide spectrum of robotic embodiments. The entire dataset and codebase are fully open-sourced, providing a valuable resource for advancing research in bimanual and multi-embodiment manipulation.

20.4CLJun 3
QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

Mengao Zhang, Xiang Yang, Chang Liu et al.

Many real-world questions over business, legal, and scientific corpora are natural-language versions of database-style queries over records latent in text. Existing retrieval-augmented generation (RAG) systems are optimized primarily for semantic relevance, but retrieving plausible passages does not guarantee correct query execution. We introduce QO-Bench, a diagnostic benchmark for query-operator question answering over typed event tuples. The benchmark covers 22,984 news articles and 614 corporate events across 18 query templates, evaluated on 785 questions. Each gold answer is deterministically computed from typed event tuples and scored by recall, with answers matched to the gold tuples by exact match rather than an LLM judge. This design enables operator-level diagnosis such as joins and intersection. We evaluate RAG, ReAct RAG, GraphRAG, and information-extraction-to-SQL under matched conditions, with a long-context oracle ceiling to isolate retrieval failure. A two-axis framework -- index-time preservation versus query-time execution -- predicts where each paradigm fails, and the results bear it out: systems retrieve relevant text but discard the typed values operators need, and the deployable paradigm ranking inverts across operators, with similarity retrieval leading on filter/project and extraction-to-SQL on intersection and counting. Even given the gold evidence, a long-context oracle stays far from saturated, so operator execution -- not retrieval alone -- is a core bottleneck that a stronger answer model does not remove. QO-Bench reframes the goal from passage relevance to query-operator-preserving retrieval.

LGMar 22, 2023Code
Variantional autoencoder with decremental information bottleneck for disentanglement

Jiantao Wu, Shentong Mo, Xiang Yang et al.

One major challenge of disentanglement learning with variational autoencoders is the trade-off between disentanglement and reconstruction fidelity. Previous studies, which increase the information bottleneck during training, tend to lose the constraint of disentanglement, leading to the information diffusion problem. In this paper, we present a novel framework for disentangled representation learning, DeVAE, which utilizes hierarchical latent spaces with decreasing information bottlenecks across these spaces. The key innovation of our approach lies in connecting the hierarchical latent spaces through disentanglement-invariant transformations, allowing the sharing of disentanglement properties among spaces while maintaining an acceptable level of reconstruction performance. We demonstrate the effectiveness of DeVAE in achieving a balance between disentanglement and reconstruction through a series of experiments and ablation studies on dSprites and Shapes3D datasets. Code is available at https://github.com/erow/disentanglement_lib/tree/pytorch#devae.

GTJun 21, 2023
Adaptive DNN Surgery for Selfish Inference Acceleration with On-demand Edge Resource

Xiang Yang, Dezhi Chen, Qi Qi et al.

Deep Neural Networks (DNNs) have significantly improved the accuracy of intelligent applications on mobile devices. DNN surgery, which partitions DNN processing between mobile devices and multi-access edge computing (MEC) servers, can enable real-time inference despite the computational limitations of mobile devices. However, DNN surgery faces a critical challenge: determining the optimal computing resource demand from the server and the corresponding partition strategy, while considering both inference latency and MEC server usage costs. This problem is compounded by two factors: (1) the finite computing capacity of the MEC server, which is shared among multiple devices, leading to inter-dependent demands, and (2) the shift in modern DNN architecture from chains to directed acyclic graphs (DAGs), which complicates potential solutions. In this paper, we introduce a novel Decentralized DNN Surgery (DDS) framework. We formulate the partition strategy as a min-cut and propose a resource allocation game to adaptively schedule the demands of mobile devices in an MEC environment. We prove the existence of a Nash Equilibrium (NE), and develop an iterative algorithm to efficiently reach the NE for each device. Our extensive experiments demonstrate that DDS can effectively handle varying MEC scenarios, achieving up to 1.25$\times$ acceleration compared to the state-of-the-art algorithm.

39.2CVApr 2Code
SafeRoPE: Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers

Xiang Yang, Feifei Li, Mi Zhang et al.

Recent Text-to-Image (T2I) models based on rectified-flow transformers (e.g., SD3, FLUX) achieve high generative fidelity but remain vulnerable to unsafe semantics, especially when triggered by multi-token interactions. Existing mitigation methods largely rely on fine-tuning or attention modulation for concept unlearning; however, their expensive computational overhead and design tailored to U-Net-based denoisers hinder direct adaptation to transformer-based diffusion models (e.g., MMDiT). In this paper, we conduct an in-depth analysis of the attention mechanism in MMDiT and find that unsafe semantics concentrate within interpretable, low-dimensional subspaces at head level, where a finite set of safety-critical heads is responsible for unsafe feature extraction. We further observe that perturbing the Rotary Positional Embedding (RoPE) applied to the query and key vectors can effectively modify some specific concepts in the generated images. Motivated by these insights, we propose SafeRoPE, a lightweight and fine-grained safe generation framework for MMDiT. Specifically, SafeRoPE first constructs head-wise unsafe subspaces by decomposing unsafe embeddings within safety-critical heads, and computes a Latent Risk Score (LRS) for each input vector via projection onto these subspaces. We then introduce head-wise RoPE perturbations that can suppress unsafe semantics without degrading benign content or image quality. SafeRoPE combines both head-wise LRS and RoPE perturbations to perform risk-specific head-wise rotation on query and key vector embeddings, enabling precise suppression of unsafe outputs while maintaining generation fidelity. Extensive experiments demonstrate that SafeRoPE achieves SOTA performance in balancing effective harmful content mitigation and utility preservation for safe generation of MMDiT. Codes are available at https://github.com/deng12yx/SafeRoPE.

FLU-DYNSep 13, 2023
Robust experimental data assimilation for the Spalart-Allmaras turbulence model

Deepinder Jot Singh Aulakh, Xiang Yang, Romit Maulik

This study presents a methodology focusing on the use of computational model and experimental data fusion to improve the Spalart-Allmaras (SA) closure model for Reynolds-averaged Navier-Stokes solutions. In particular, our goal is to develop a technique that not only assimilates sparse experimental data to improve turbulence model performance, but also preserves generalization for unseen cases by recovering classical SA behavior. We achieve our goals using data assimilation, namely the Ensemble Kalman filtering approach (EnKF), to calibrate the coefficients of the SA model for separated flows. A holistic calibration strategy is implemented via the parameterization of the production, diffusion, and destruction terms. This calibration relies on the assimilation of experimental data collected in the form of velocity profiles, skin friction, and pressure coefficients. Despite using observational data from a single flow condition around a backward-facing step (BFS), the recalibrated SA model demonstrates generalization to other separated flows, including cases such as the 2D NASA wall mounted hump (2D-WMH) and modified BFS. Significant improvement is observed in the quantities of interest, i.e., skin friction coefficient ($C_f$) and pressure coefficient ($C_p$) for each flow tested. Finally, it is also demonstrated that the newly proposed model recovers SA proficiency for flows, such as a NACA-0012 airfoil and axisymmetric jet (ASJ), and that the individually calibrated terms in the SA model target specific flow-physics wherein the calibrated production term improves the re-circulation zone while destruction improves the recovery zone.

12.0CVMay 18
UAVFF3D: A Geometry-Aware Benchmark for Feed-Forward UAV 3D Reconstruction

Xiang Yang, Yongli Wang, HaiFeng Li et al.

Feed-forward 3D reconstruction has recently demonstrated strong generalization across diverse scenes, yet its performance in UAV imagery remains underexplored due to distinctive acquisition geometries, large viewpoint variations, and ambiguity between horizontal field of view and flight height. We present UAVFF3D, a geometry-aware benchmark for feed-forward UAV 3D reconstruction, comprising over 170K real UAV images and more than 370K high-quality synthetic images. The benchmark also includes a challenging diagnostic test subset designed to analyze model behavior under UAV-specific geometric ambiguities.Building on UAVFF3D, we propose an evaluation protocol that jointly assesses camera-geometry estimation and reconstruction accuracy, addressing limitations of existing evaluations that rely on separate alignments. Experiments on four representative feed-forward reconstruction models show that UAV-domain adaptation substantially improves performance, reducing Ray Error by up to 84.2%, Pose ATE by up to 76.0%, and Chamfer Distance by up to 41.1%. Further analysis reveals that domain adaptation mitigates rotation-estimation degradation in oblique-view scenes and improves robustness under horizontal-field-of-view/height ambiguity. Incorporating camera priors further enhances reconstruction performance under UAV-specific acquisition geometries.

CVJan 17, 2025
3rd Workshop on Maritime Computer Vision (MaCVi) 2025: Challenge Results

Benjamin Kiefer, Lojze Žust, Jon Muhovič et al.

The 3rd Workshop on Maritime Computer Vision (MaCVi) 2025 addresses maritime computer vision for Unmanned Surface Vehicles (USV) and underwater. This report offers a comprehensive overview of the findings from the challenges. We provide both statistical and qualitative analyses, evaluating trends from over 700 submissions. All datasets, evaluation code, and the leaderboard are available to the public at https://macvi.org/workshop/macvi25.

CVJan 4
ShadowGS: Shadow-Aware 3D Gaussian Splatting for Satellite Imagery

Feng Luo, Hongbo Pan, Xiang Yang et al.

3D Gaussian Splatting (3DGS) has emerged as a novel paradigm for 3D reconstruction from satellite imagery. However, in multi-temporal satellite images, prevalent shadows exhibit significant inconsistencies due to varying illumination conditions. To address this, we propose ShadowGS, a novel framework based on 3DGS. It leverages a physics-based rendering equation from remote sensing, combined with an efficient ray marching technique, to precisely model geometrically consistent shadows while maintaining efficient rendering. Additionally, it effectively disentangles different illumination components and apparent attributes in the scene. Furthermore, we introduce a shadow consistency constraint that significantly enhances the geometric accuracy of 3D reconstruction. We also incorporate a novel shadow map prior to improve performance with sparse-view inputs. Extensive experiments demonstrate that ShadowGS outperforms current state-of-the-art methods in shadow decoupling accuracy, 3D reconstruction precision, and novel view synthesis quality, with only a few minutes of training. ShadowGS exhibits robust performance across various settings, including RGB, pansharpened, and sparse-view satellite inputs.

QMJul 16, 2025
GLOMIA-Pro: A Generalizable Longitudinal Medical Image Analysis Framework for Disease Progression Prediction

Shuaitong Zhang, Yuchen Sun, Yong Ao et al.

Longitudinal medical images are essential for monitoring disease progression by capturing spatiotemporal changes associated with dynamic biological processes. While current methods have made progress in modeling spatiotemporal patterns, they face three key limitations: (1) lack of generalizable framework applicable to diverse disease progression prediction tasks; (2) frequent overlook of the ordinal nature inherent in disease staging; (3) susceptibility to representation collapse due to structural similarities between adjacent time points, which can obscure subtle but discriminative progression biomarkers. To address these limitations, we propose a Generalizable LOngitudinal Medical Image Analysis framework for disease Progression prediction (GLOMIA-Pro). GLOMIA-Pro consists of two core components: progression representation extraction and progression-aware fusion. The progression representation extraction module introduces a piecewise orthogonal attention mechanism and employs a novel ordinal progression constraint to disentangle finegrained temporal imaging variations relevant to disease progression. The progression-aware fusion module incorporates a redesigned skip connection architecture which integrates the learned progression representation with current imaging representation, effectively mitigating representation collapse during cross-temporal fusion. Validated on two distinct clinical applications: knee osteoarthritis severity prediction and esophageal cancer treatment response assessment, GLOMIA-Pro consistently outperforms seven state-of-the-art longitudinal analysis methods. Ablation studies further confirm the contribution of individual components, demonstrating the robustness and generalizability of GLOMIA-Pro across diverse clinical scenarios.

CVSep 26, 2017
Scale Adaptive Clustering of Multiple Structures

Xiang Yang, Peter Meer

We propose the segmentation of noisy datasets into Multiple Inlier Structures with a new Robust Estimator (MISRE). The scale of each individual structure is estimated adaptively from the input data and refined by mean shift, without tuning any parameter in the process, or manually specifying thresholds for different estimation problems. Once all the data points were classified into separate structures, these structures are sorted by their densities with the strongest inlier structures coming out first. Several 2D and 3D synthetic and real examples are presented to illustrate the efficiency, robustness and the limitations of the MISRE algorithm.

CVSep 20, 2016
Robust Estimation of Multiple Inlier Structures

Xiang Yang, Peter Meer

The robust estimator presented in this paper processes each structure independently. The scales of the structures are estimated adaptively and no threshold is involved in spite of different objective functions. The user has to specify only the number of elemental subsets for random sampling. After classifying all the input data, the segmented structures are sorted by their strengths and the strongest inlier structures come out at the top. Like any robust estimators, this algorithm also has limitations which are described in detail. Several synthetic and real examples are presented to illustrate every aspect of the algorithm.