Rui Gao

LG
h-index35
44papers
769citations
Novelty54%
AI Score58

44 Papers

FLU-DYNDec 30, 2022
A Finite Element-Inspired Hypergraph Neural Network: Application to Fluid Dynamics Simulations

Rui Gao, Indu Kant Deo, Rajeev K. Jaiman

An emerging trend in deep learning research focuses on the applications of graph neural networks (GNNs) for mesh-based continuum mechanics simulations. Most of these learning frameworks operate on graphs wherein each edge connects two nodes. Inspired by the data connectivity in the finite element method, we present a method to construct a hypergraph by connecting the nodes by elements rather than edges. A hypergraph message-passing network is defined on such a node-element hypergraph that mimics the calculation process of local stiffness matrices. We term this method a finite element-inspired hypergraph neural network, in short FEIH($φ$)-GNN. We further equip the proposed network with rotation equivariance, and explore its capability for modeling unsteady fluid flow systems. The effectiveness of the network is demonstrated on two common benchmark problems, namely the fluid flow around a circular cylinder and airfoil configurations. Stabilized and accurate temporal roll-out predictions can be obtained using the $φ$-GNN framework within the interpolation Reynolds number range. The network is also able to extrapolate moderately towards higher Reynolds number domain out of the training range.

FLU-DYNOct 9, 2022
Predicting fluid-structure interaction with graph neural networks

Rui Gao, Rajeev K. Jaiman

We present a rotation equivariant, quasi-monolithic graph neural network framework for the reduced-order modeling of fluid-structure interaction systems. With the aid of an arbitrary Lagrangian-Eulerian formulation, the system states are evolved temporally with two sub-networks. The movement of the mesh is reduced to the evolution of several coefficients via complex-valued proper orthogonal decomposition, and the prediction of these coefficients over time is handled by a single multi-layer perceptron. A finite element-inspired hypergraph neural network is employed to predict the evolution of the fluid state based on the state of the whole system. The structural state is implicitly modeled by the movement of the mesh on the solid-fluid interface; hence it makes the proposed framework quasi-monolithic. The effectiveness of the proposed framework is assessed on two prototypical fluid-structure systems, namely the flow around an elastically-mounted cylinder, and the flow around a hyperelastic plate attached to a fixed cylinder. The proposed framework tracks the interface description and provides stable and accurate system state predictions during roll-out for at least 2000 time steps, and even demonstrates some capability in self-correcting erroneous predictions. The proposed framework also enables direct calculation of the lift and drag forces using the predicted fluid and mesh states, in contrast to existing convolution-based architectures. The proposed reduced-order model via graph neural network has implications for the development of physics-based digital twins concerning moving boundaries and fluid-structure interactions.

LGJan 27, 2023
Aleatoric and Epistemic Discrimination: Fundamental Limits of Fairness Interventions

Hao Wang, Luxi He, Rui Gao et al.

Machine learning (ML) models can underperform on certain population groups due to choices made during model development and bias inherent in the data. We categorize sources of discrimination in the ML pipeline into two classes: aleatoric discrimination, which is inherent in the data distribution, and epistemic discrimination, which is due to decisions made during model development. We quantify aleatoric discrimination by determining the performance limits of a model under fairness constraints, assuming perfect knowledge of the data distribution. We demonstrate how to characterize aleatoric discrimination by applying Blackwell's results on comparing statistical experiments. We then quantify epistemic discrimination as the gap between a model's accuracy when fairness constraints are applied and the limit posed by aleatoric discrimination. We apply this approach to benchmark existing fairness interventions and investigate fairness risks in data with missing values. Our results indicate that state-of-the-art fairness interventions are effective at removing epistemic discrimination on standard (overused) tabular datasets. However, when data has missing values, there is still significant room for improvement in handling aleatoric discrimination.

DCDec 15, 2025Code
SIGMA: An AI-Empowered Training Stack on Early-Life Hardware

Lei Qu, Lianhai Ren, Peng Cheng et al.

An increasing variety of AI accelerators is being considered for large-scale training. However, enabling large-scale training on early-life AI accelerators faces three core challenges: frequent system disruptions and undefined failure modes that undermine reliability; numerical errors and training instabilities that threaten correctness and convergence; and the complexity of parallelism optimization combined with unpredictable local noise that degrades efficiency. To address these challenges, SIGMA is an open-source training stack designed to improve the reliability, stability, and efficiency of large-scale distributed training on early-life AI hardware. The core of this initiative is the LUCIA TRAINING PLATFORM (LTP), the system optimized for clusters with early-life AI accelerators. Since its launch in March 2025, LTP has significantly enhanced training reliability and operational productivity. Over the past five months, it has achieved an impressive 94.45% effective cluster accelerator utilization, while also substantially reducing node recycling and job-recovery times. Building on the foundation of LTP, the LUCIA TRAINING FRAMEWORK (LTF) successfully trained SIGMA-MOE, a 200B MoE model, using 2,048 AI accelerators. This effort delivered remarkable stability and efficiency outcomes, achieving 21.08% MFU, state-of-the-art downstream accuracy, and encountering only one stability incident over a 75-day period. Together, these advances establish SIGMA, which not only tackles the critical challenges of large-scale training but also establishes a new benchmark for AI infrastructure and platform innovation, offering a robust, cost-effective alternative to prevailing established accelerator stacks and significantly advancing AI capabilities and scalability. The source code of SIGMA is available at https://github.com/microsoft/LuciaTrainingPlatform.

CLDec 18, 2025Code
Sigma-MoE-Tiny Technical Report

Qingguo Hu, Zhenghao Lin, Ziyue Yang et al.

Mixture-of-Experts (MoE) has emerged as a promising paradigm for foundation models due to its efficient and powerful scalability. In this work, we present Sigma-MoE-Tiny, an MoE language model that achieves the highest sparsity compared to existing open-source models. Sigma-MoE-Tiny employs fine-grained expert segmentation with up to 96 experts per layer, while activating only one expert for each token, resulting in 20B total parameters with just 0.5B activated. The major challenge introduced by such extreme sparsity lies in expert load balancing. We find that the widely-used load balancing loss tends to become ineffective in the lower layers under this setting. To address this issue, we propose a progressive sparsification schedule aiming to balance expert utilization and training stability. Sigma-MoE-Tiny is pre-trained on a diverse and high-quality corpus, followed by post-training to further unlock its capabilities. The entire training process remains remarkably stable, with no occurrence of irrecoverable loss spikes. Comprehensive evaluations reveal that, despite activating only 0.5B parameters, Sigma-MoE-Tiny achieves top-tier performance among counterparts of comparable or significantly larger scale. In addition, we provide an in-depth discussion of load balancing in highly sparse MoE models, offering insights for advancing sparsity in future MoE architectures. Project page: https://qghuxmu.github.io/Sigma-MoE-Tiny Code: https://github.com/microsoft/ltp-megatron-lm

OCApr 30, 2022
A Short and General Duality Proof for Wasserstein Distributionally Robust Optimization

Luhao Zhang, Jincheng Yang, Rui Gao

We present a general duality result for Wasserstein distributionally robust optimization that holds for any Kantorovich transport cost, measurable loss function, and nominal probability distribution. Assuming an interchangeability principle inherent in existing duality results, our proof only uses one-dimensional convex analysis. Furthermore, we demonstrate that the interchangeability principle holds if and only if certain measurable projection and weak measurable selection conditions are satisfied. To illustrate the broader applicability of our approach, we provide a rigorous treatment of duality results in distributionally robust Markov decision processes and distributionally robust multistage stochastic programming. Additionally, we extend our analysis to other problems such as infinity-Wasserstein distributionally robust optimization, risk-averse optimization, and globalized distributionally robust counterpart.

85.4LGMay 25
Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

Zhaoyu Zhu, Rui Gao, Shuang Li

Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state-conditional policy by transporting it along the action gradient of the soft Q-function together with a Langevin-type diffusion. Despite its appeal for continuous-control problems, its global convergence properties remain poorly understood. Standard Langevin analyses do not directly apply, because the RL objective depends on the policy through the Bellman recursion rather than through a static convex functional, and the Langevin drift is determined by the soft Q-function, whose regularity must be controlled along the policy iterates. In this paper, we develop a global convergence theory for WPG by exploiting the Bellman structure of entropy-regularized RL. We show that the role usually played by convexity can be replaced by a Bellman-based argument: the soft Bellman residual admits a statewise KL representation with respect to a Gibbs policy; Bellman contraction relates this residual to the global optimality gap; and a Bellman resolvent identity connects value improvement to relative Fisher information. Combined with a uniform log-Sobolev inequality (LSI) for the evolving Gibbs family, these ingredients yield a distributional Polyak--Łojasiewicz condition. We further establish the regularity and uniform bounds needed to control the discretization error, thereby obtaining geometric contraction up to a discretization bias. Conceptually, our analysis shows that although entropy-regularized RL is not convex in the usual flat sense, the Bellman recursion induces a favorable Polyak--Lojasiewicz-type (PL) geometry that supports global convergence of WPG.

FLU-DYNNov 1, 2022
Combined space-time reduced-order model with 3D deep convolution for extrapolating fluid dynamics

Indu Kant Deo, Rui Gao, Rajeev Jaiman

There is a critical need for efficient and reliable active flow control strategies to reduce drag and noise in aerospace and marine engineering applications. While traditional full-order models based on the Navier-Stokes equations are not feasible, advanced model reduction techniques can be inefficient for active control tasks, especially with strong non-linearity and convection-dominated phenomena. Using convolutional recurrent autoencoder network architectures, deep learning-based reduced-order models have been recently shown to be effective while performing several orders of magnitude faster than full-order simulations. However, these models encounter significant challenges outside the training data, limiting their effectiveness for active control and optimization tasks. In this study, we aim to improve the extrapolation capability by modifying network architecture and integrating coupled space-time physics as an implicit bias. Reduced-order models via deep learning generally employ decoupling in spatial and temporal dimensions, which can introduce modeling and approximation errors. To alleviate these errors, we propose a novel technique for learning coupled spatial-temporal correlation using a 3D convolution network. We assess the proposed technique against a standard encoder-propagator-decoder model and demonstrate a superior extrapolation performance. To demonstrate the effectiveness of 3D convolution network, we consider a benchmark problem of the flow past a circular cylinder at laminar flow conditions and use the spatio-temporal snapshots from the full-order simulations. Our proposed 3D convolution architecture accurately captures the velocity and pressure fields for varying Reynolds numbers. Compared to the standard encoder-propagator-decoder network, the spatio-temporal-based 3D convolution network improves the prediction range of Reynolds numbers outside of the training data.

LGJul 3, 2024
SFC: Achieve Accurate Fast Convolution under Low-precision Arithmetic

Liulu He, Yufei Zhao, Rui Gao et al.

Fast convolution algorithms, including Winograd and FFT, can efficiently accelerate convolution operations in deep models. However, these algorithms depend on high-precision arithmetic to maintain inference accuracy, which conflicts with the model quantization. To resolve this conflict and further improve the efficiency of quantized convolution, we proposes SFC, a new algebra transform for fast convolution by extending the Discrete Fourier Transform (DFT) with symbolic computing, in which only additions are required to perform the transformation at specific transform points, avoiding the calculation of irrational number and reducing the requirement for precision. Additionally, we enhance convolution efficiency by introducing correction terms to convert invalid circular convolution outputs of the Fourier method into effective ones. The numerical error analysis is presented for the first time in this type of work and proves that our algorithms can provide a 3.68x multiplication reduction for 3x3 convolution, while the Winograd algorithm only achieves a 2.25x reduction with similarly low numerical errors. Experiments carried out on benchmarks and FPGA show that our new algorithms can further improve the computation efficiency of quantized models while maintaining accuracy, surpassing both the quantization-alone method and existing works on fast convolution quantization.

91.7LGApr 24Code
C-MORAL: Controllable Multi-Objective Molecular Optimization with Reinforcement Alignment for LLMs

Rui Gao, Youngseung Jeon, Swastik Roy et al.

Large language models (LLMs) show promise for molecular optimization, but aligning them with selective and competing drug-design constraints remains challenging. We propose C-Moral, a reinforcement learning post-training framework for controllable multi-objective molecular optimization. C-Moral combines group-based relative optimization, property score alignment for heterogeneous objectives, and continuous non-linear reward aggregation to improve stability across competing properties. Experiments on the C-MuMOInstruct benchmark show that C-Moral consistently outperforms state-of-the-art models across both in-domain and out-of-domain settings, achieving the best Success Optimized Rate (SOR) of 48.9% on IND tasks and 39.5% on OOD tasks, while largely preserving scaffold similarity. These results suggest that RL post-training is an effective way to align molecular language models with continuous molecular design objectives. Our code and models are publicly available at https://github.com/Rwigie/C-MORAL.

LGAug 19, 2024
Regularization for Adversarial Robust Learning

Jie Wang, Rui Gao, Yao Xie

Despite the growing prevalence of artificial neural networks in real-world applications, their vulnerability to adversarial attacks remains a significant concern, which motivates us to investigate the robustness of machine learning models. While various heuristics aim to optimize the distributionally robust risk using the $\infty$-Wasserstein metric, such a notion of robustness frequently encounters computation intractability. To tackle the computational challenge, we develop a novel approach to adversarial training that integrates $φ$-divergence regularization into the distributionally robust risk function. This regularization brings a notable improvement in computation compared with the original formulation. We develop stochastic gradient methods with biased oracles to solve this problem efficiently, achieving the near-optimal sample complexity. Moreover, we establish its regularization effects and demonstrate it is asymptotic equivalence to a regularized empirical risk minimization framework, by considering various scaling regimes of the regularization parameter and robustness level. These regimes yield gradient norm regularization, variance regularization, or a smoothed gradient norm regularization that interpolates between these extremes. We numerically validate our proposed method in supervised learning, reinforcement learning, and contextual learning and showcase its state-of-the-art performance against various adversarial attacks.

OCJul 23, 2024
Data-driven Multistage Distributionally Robust Linear Optimization with Nested Distance

Rui Gao, Rohit Arora, Yizhe Huang

We study multistage distributionally robust linear optimization, where the uncertainty set is defined as a ball of distribution centered at a scenario tree using the nested distance. The resulting minimax problem is notoriously difficult to solve due to its inherent non-convexity. In this paper, we demonstrate that, under mild conditions, the robust risk evaluation of a given policy can be expressed in an equivalent recursive form. Furthermore, assuming stagewise independence, we derive equivalent dynamic programming reformulations to find an optimal robust policy that is time-consistent and well-defined on unseen sample paths. Our reformulations reconcile two modeling frameworks: the multistage-static formulation (with nested distance) and the multistage-dynamic formulation (with one-period Wasserstein distance). Moreover, we identify tractable cases when the value functions can be computed efficiently using convex optimization techniques.

CVJan 22, 2022Code
Phase-SLAM: Phase Based Simultaneous Localization and Mapping for Mobile Structured Light Illumination Systems

Xi Zheng, Rui Ma, Rui Gao et al.

Structured Light Illumination (SLI) systems have been used for reliable indoor dense 3D scanning via phase triangulation. However, mobile SLI systems for 360 degree 3D reconstruction demand 3D point cloud registration, involving high computational complexity. In this paper, we propose a phase based Simultaneous Localization and Mapping (Phase-SLAM) framework for fast and accurate SLI sensor pose estimation and 3D object reconstruction. The novelty of this work is threefold: (1) developing a reprojection model from 3D points to 2D phase data towards phase registration with low computational complexity; (2) developing a local optimizer to achieve SLI sensor pose estimation (odometry) using the derived Jacobian matrix for the 6 DoF variables; (3) developing a compressive phase comparison method to achieve high-efficiency loop closure detection. The whole Phase-SLAM pipeline is then exploited using existing global pose graph optimization techniques. We build datasets from both the unreal simulation platform and a robotic arm based SLI system in real-world to verify the proposed approach. The experiment results demonstrate that the proposed Phase-SLAM outperforms other state-of-the-art methods in terms of the efficiency and accuracy of pose estimation and 3D reconstruction. The open-source code is available at https://github.com/ZHENGXi-git/Phase-SLAM.

LGMar 3
Wasserstein Proximal Policy Gradient

Zhaoyu Zhu, Shuhan Zhang, Rui Gao et al.

We study policy gradient methods for continuous-action, entropy-regularized reinforcement learning through the lens of Wasserstein geometry. Starting from a Wasserstein proximal update, we derive Wasserstein Proximal Policy Gradient (WPPG) via an operator-splitting scheme that alternates an optimal transport update with a heat step implemented by Gaussian convolution. This formulation avoids evaluating the policy's log density or its gradient, making the method directly applicable to expressive implicit stochastic policies specified as pushforward maps. We establish a global linear convergence rate for WPPG, covering both exact policy evaluation and actor-critic implementations with controlled approximation error. Empirically, WPPG is simple to implement and attains competitive performance on standard continuous-control benchmarks.

61.3SYApr 30
Distributed Observer Design for Discrete-Time LTI Systems via Jordan Canonical Form

Giulio Fattore, Maria Elena Valcher, Rui Gao et al.

This paper addresses the problem of distributed state estimation for discrete-time linear time-invariant systems. Building on the framework proposed in Gao & Yang (2025), we exploit the Jordan canonical form of the system matrix to develop two distributed estimation schemes that ensure asymptotic convergence of local estimates to the true system state. In both approaches, each node reconstructs the components of the state that are locally detectable for it via a Luenberger observer, while employing a consensus-based mechanism to estimate the components that are not directly detectable. The first scheme relies on local observers whose dimension matches that of the original state vector; however, its applicability requires the satisfaction of a large set of inequalities. The second scheme, in contrast, can be implemented under less restrictive conditions, but results in observers of increased (augmented) order. For both methods, we derive necessary and sufficient conditions - expressed in terms of the eigenvalues of the system matrix and certain submatrices of the communication network Laplacian - that guarantee the existence of a distributed observer achieving asymptotically accurate estimation. Compared to Gao & Yang (2025), the proposed approaches offer greater flexibility in the selection of coupling gains and impose less stringent solvability conditions.

LGJan 8
DeepHalo: A Neural Choice Model with Controllable Context Effects

Shuhan Zhang, Zhi Wang, Rui Gao et al.

Modeling human decision-making is central to applications such as recommendation, preference learning, and human-AI alignment. While many classic models assume context-independent choice behavior, a large body of behavioral research shows that preferences are often influenced by the composition of the choice set itself -- a phenomenon known as the context effect or Halo effect. These effects can manifest as pairwise (first-order) or even higher-order interactions among the available alternatives. Recent models that attempt to capture such effects either focus on the featureless setting or, in the feature-based setting, rely on restrictive interaction structures or entangle interactions across all orders, which limits interpretability. In this work, we propose DeepHalo, a neural modeling framework that incorporates features while enabling explicit control over interaction order and principled interpretation of context effects. Our model enables systematic identification of interaction effects by order and serves as a universal approximator of context-dependent choice functions when specialized to a featureless setting. Experiments on synthetic and real-world datasets demonstrate strong predictive performance while providing greater transparency into the drivers of choice.

IVMar 19, 2025
FetalFlex: Anatomy-Guided Diffusion Model for Flexible Control on Fetal Ultrasound Image Synthesis

Yaofei Duan, Tao Tan, Zhiyuan Zhu et al.

Fetal ultrasound (US) examinations require the acquisition of multiple planes, each providing unique diagnostic information to evaluate fetal development and screening for congenital anomalies. However, obtaining a comprehensive, multi-plane annotated fetal US dataset remains challenging, particularly for rare or complex anomalies owing to their low incidence and numerous subtypes. This poses difficulties in training novice radiologists and developing robust AI models, especially for detecting abnormal fetuses. In this study, we introduce a Flexible Fetal US image generation framework (FetalFlex) to address these challenges, which leverages anatomical structures and multimodal information to enable controllable synthesis of fetal US images across diverse planes. Specifically, FetalFlex incorporates a pre-alignment module to enhance controllability and introduces a repaint strategy to ensure consistent texture and appearance. Moreover, a two-stage adaptive sampling strategy is developed to progressively refine image quality from coarse to fine levels. We believe that FetalFlex is the first method capable of generating both in-distribution normal and out-of-distribution abnormal fetal US images, without requiring any abnormal data. Experiments on multi-center datasets demonstrate that FetalFlex achieved state-of-the-art performance across multiple image quality metrics. A reader study further confirms the close alignment of the generated results with expert visual assessments. Furthermore, synthetic images by FetalFlex significantly improve the performance of six typical deep models in downstream classification and anomaly detection tasks. Lastly, FetalFlex's anatomy-level controllable generation offers a unique advantage for anomaly simulation and creating paired or counterfactual data at the pixel level. The demo is available at: https://dyf1023.github.io/FetalFlex/.

ROSep 4, 2025
Long-Horizon Visual Imitation Learning via Plan and Code Reflection

Quan Chen, Chenrui Shi, Qi Chen et al.

Learning from long-horizon demonstrations with complex action sequences presents significant challenges for visual imitation learning, particularly in understanding temporal relationships of actions and spatial relationships between objects. In this paper, we propose a new agent framework that incorporates two dedicated reflection modules to enhance both plan and code generation. The plan generation module produces an initial action sequence, which is then verified by the plan reflection module to ensure temporal coherence and spatial alignment with the demonstration video. The code generation module translates the plan into executable code, while the code reflection module verifies and refines the generated code to ensure correctness and consistency with the generated plan. These two reflection modules jointly enable the agent to detect and correct errors in both the plan generation and code generation, improving performance in tasks with intricate temporal and spatial dependencies. To support systematic evaluation, we introduce LongVILBench, a benchmark comprising 300 human demonstrations with action sequences of up to 18 steps. LongVILBench emphasizes temporal and spatial complexity across multiple task types. Experimental results demonstrate that existing methods perform poorly on this benchmark, whereas our new framework establishes a strong baseline for long-horizon visual imitation learning.

LGMay 7, 2024
Relating-Up: Advancing Graph Neural Networks through Inter-Graph Relationships

Qi Zou, Na Yu, Daoliang Zhang et al.

Graph Neural Networks (GNNs) have excelled in learning from graph-structured data, especially in understanding the relationships within a single graph, i.e., intra-graph relationships. Despite their successes, GNNs are limited by neglecting the context of relationships across graphs, i.e., inter-graph relationships. Recognizing the potential to extend this capability, we introduce Relating-Up, a plug-and-play module that enhances GNNs by exploiting inter-graph relationships. This module incorporates a relation-aware encoder and a feedback training strategy. The former enables GNNs to capture relationships across graphs, enriching relation-aware graph representation through collective context. The latter utilizes a feedback loop mechanism for the recursively refinement of these representations, leveraging insights from refining inter-graph dynamics to conduct feedback loop. The synergy between these two innovations results in a robust and versatile module. Relating-Up enhances the expressiveness of GNNs, enabling them to encapsulate a wider spectrum of graph relationships with greater precision. Our evaluations across 16 benchmark datasets demonstrate that integrating Relating-Up into GNN architectures substantially improves performance, positioning Relating-Up as a formidable choice for a broad spectrum of graph representation learning tasks.

MLMar 21, 2024
Non-Convex Robust Hypothesis Testing using Sinkhorn Uncertainty Sets

Jie Wang, Rui Gao, Yao Xie

We present a new framework to address the non-convex robust hypothesis testing problem, wherein the goal is to seek the optimal detector that minimizes the maximum of worst-case type-I and type-II risk functions. The distributional uncertainty sets are constructed to center around the empirical distribution derived from samples based on Sinkhorn discrepancy. Given that the objective involves non-convex, non-smooth probabilistic functions that are often intractable to optimize, existing methods resort to approximations rather than exact solutions. To tackle the challenge, we introduce an exact mixed-integer exponential conic reformulation of the problem, which can be solved into a global optimum with a moderate amount of input data. Subsequently, we propose a convex approximation, demonstrating its superiority over current state-of-the-art methodologies in literature. Furthermore, we establish connections between robust hypothesis testing and regularized formulations of non-robust risk functions, offering insightful interpretations. Our numerical study highlights the satisfactory testing performance and computational efficiency of the proposed framework.

CVFeb 21
MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions

Haoyu Zhang, Yuwei Wu, Pengxiang Li et al.

In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs, where initial inferences often lead to hallucinations or logic errors. Existing VLMs often produce plausible yet ungrounded answers, and even when prompted to "reflect", their corrections may remain detached from the image evidence. To address this, we propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions. By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision, which are repeated until the output is visually grounded. To facilitate training of this model, we construct **ReflectV**, a visual reflective dataset for multi-turn supervision that explicitly contains reflection triggers, region-based verification actions, and answer revision grounded in visual evidence. Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.

CVMar 7
Facial Expression Generation Aligned with Human Preference for Natural Dyadic Interaction

Xu Chen, Rui Gao, Xinjie Zhang et al.

Achieving natural dyadic interaction requires generating facial expressions that are emotionally appropriate and socially aligned with human preference. Human feedback offers a compelling mechanism to guide such alignment, yet how to effectively incorporate this feedback into facial expression generation remains underexplored. In this paper, we propose a facial expression generation method aligned with human preference by leveraging human feedback to produce contextually and emotionally appropriate expressions for natural dyadic interaction. A key to our method is framing the generation of identity-independent facial expressions as an action learning process, allowing human feedback to assess their validity free from visual or identity bias. We establish a closed feedback loop in which listener expressions dynamically respond to evolving conversational cues of the speaker. Concretely, we train a vision-language-action model via supervised fine-tuning to map the speaker's multimodal signals into controllable low-dimensional expression representations of a 3D morphable model. We further introduce a human-feedback reinforcement learning strategy that integrates the imitation of high-quality expression response with critic-guided optimization. Experiments on two benchmarks demonstrate that our method effectively aligns facial expressions with human preference and achieves superior performance.

ROMar 7
Morphology-Independent Facial Expression Imitation for Human-Face Robots

Xu Chen, Rui Gao, Che Sun et al.

Accurate facial expression imitation on human-face robots is crucial for achieving natural human-robot interaction. Most existing methods have achieved photorealistic expression imitation through mapping 2D facial landmarks to a robot's actuator commands. Their imitation of landmark trajectories is susceptible to interference from facial morphology, which would lead to a performance drop. In this paper, we propose a morphology-independent expression imitation method that decouples expressions from facial morphology to eliminate morphological influence and produce more realistic expressions for human-face robots. Specifically, we construct an expression decoupling module to learn expression semantics by disentangling the expression representation from the morphology representation in a self-supervised manner. We devise an expression transfer module to map the representations to the robot's actuator commands through a learning objective of perceiving expression errors, producing accurate facial expressions based on the learned expression semantics. To support experimental validation, a custom-designed and highly expressive human-face robot, namely Pengrui, is developed to serve as an experimental platform for realistic expression imitation. Extensive experiments demonstrate that our method enables the human-face robot to reproduce a wide range of human-like expressions effectively. All code and implementation details of the robot will be released.

CVMar 7
Fine-Grained 3D Facial Reconstruction for Micro-Expressions

Che Sun, Xinjie Zhang, Rui Gao et al.

Recent advances in 3D facial expression reconstruction have demonstrated remarkable performance in capturing macro-expressions, yet the reconstruction of micro-expressions remains unexplored. This novel task is particularly challenging due to the subtle, transient, and low-intensity nature of micro-expressions, which complicate the extraction of stable and discriminative features essential for accurate reconstruction. In this paper, we propose a fine-grained micro-expression reconstruction method that integrates a global dynamic feature capturing stable facial motion patterns with a locally-enriched feature incorporating multiple informative cues from 2D motions, facial priors and 3D facial geometry. Specifically, we devise a plug-and-play dynamic-encoded module to extract micro-expression feature for global facial action, allowing it to leverage prior knowledge from abundant macro-expression data to mitigate the scarcity of micro-expression data. Subsequently, a dynamic-guided mesh deformation module is designed for extracting aggregated local features from dense optical flow, sparse landmark cues and facial mesh geometry, which adaptively refines fine-grained facial micro-expression without compromising global 3D geometry. Extensive experiments on micro-expression datasets demonstrate that our method consistently outperforms state-of-the-art methods in both geometric accuracy and perceptual detail.

NCJun 9, 2025
Automatic Depression Assessment using Machine Learning: A Comprehensive Survey

Siyang Song, Yupeng Huo, Shiqing Tang et al.

Depression is a common mental illness across current human society. Traditional depression assessment relying on inventories and interviews with psychologists frequently suffer from subjective diagnosis results, slow and expensive diagnosis process as well as lack of human resources. Since there is a solid evidence that depression is reflected by various human internal brain activities and external expressive behaviours, early traditional machine learning (ML) and advanced deep learning (DL) models have been widely explored for human behaviour-based automatic depression assessment (ADA) since 2012. However, recent ADA surveys typically only focus on a limited number of human behaviour modalities. Despite being used as a theoretical basis for developing ADA approaches, existing ADA surveys lack a comprehensive review and summary of multi-modal depression-related human behaviours. To bridge this gap, this paper specifically summarises depression-related human behaviours across a range of modalities (e.g. the human brain, verbal language and non-verbal audio/facial/body behaviours). We focus on conducting an up-to-date and comprehensive survey of ML-based ADA approaches for learning depression cues from these behaviours as well as discussing and comparing their distinctive features and limitations. In addition, we also review existing ADA competitions and datasets, identify and discuss the main challenges and opportunities to provide further research directions for future ADA researchers.

FLU-DYNMar 28, 2025
Data-driven modeling of fluid flow around rotating structures with graph neural networks

Rui Gao, Zhi Cheng, Rajeev K. Jaiman

Graph neural networks, recently introduced into the field of fluid flow surrogate modeling, have been successfully applied to model the temporal evolution of various fluid flow systems. Existing applications, however, are mostly restricted to cases where the domain is time-invariant. The present work extends the application of graph neural network-based modeling to fluid flow around structures rotating with respect to a certain axis. Specifically, we propose to apply a graph neural network-based surrogate modeling for fluid flow with the mesh corotating with the structure. Unlike conventional data-driven approaches that rely on structured Cartesian meshes, our framework operates on unstructured co-rotating meshes, enforcing rotation equivariance of the learned model by leveraging co-rotating polar (2D) and cylindrical (3D) coordinate systems. To model the pressure for systems without Dirichlet pressure boundaries, we propose a novel local directed pressure difference formulation that is invariant to the reference pressure point and value. For flow systems with large mesh sizes, we introduce a scheme to train the network in single or distributed graphics processing units by accumulating the backpropagated gradients from partitions of the mesh. The effectiveness of our proposed framework is examined on two test cases: (i) fluid flow in a 2D rotating mixer, and (ii) the flow past a 3D rotating cube. Our results show that the model achieves stable and accurate rollouts for over 2000 time steps in periodic regimes while capturing accurate short-term dynamics in chaotic flow regimes. In addition, the drag and lift force predictions closely match the CFD calculations, highlighting the potential of the framework for modeling both periodic and chaotic fluid flow around rotating structures.

CVMar 14, 2024
Sentinel-Guided Zero-Shot Learning: A Collaborative Paradigm without Real Data Exposure

Fan Wan, Xingyu Miao, Haoran Duan et al.

With increasing concerns over data privacy and model copyrights, especially in the context of collaborations between AI service providers and data owners, an innovative SG-ZSL paradigm is proposed in this work. SG-ZSL is designed to foster efficient collaboration without the need to exchange models or sensitive data. It consists of a teacher model, a student model and a generator that links both model entities. The teacher model serves as a sentinel on behalf of the data owner, replacing real data, to guide the student model at the AI service provider's end during training. Considering the disparity of knowledge space between the teacher and student, we introduce two variants of the teacher model: the omniscient and the quasi-omniscient teachers. Under these teachers' guidance, the student model seeks to match the teacher model's performance and explores domains that the teacher has not covered. To trade off between privacy and performance, we further introduce two distinct security-level training protocols: white-box and black-box, enhancing the paradigm's adaptability. Despite the inherent challenges of real data absence in the SG-ZSL paradigm, it consistently outperforms in ZSL and GZSL tasks, notably in the white-box protocol. Our comprehensive evaluation further attests to its robustness and efficiency across various setups, including stringent black-box training protocol.

IVFeb 10, 2024
Point cloud-based registration and image fusion between cardiac SPECT MPI and CTA

Shaojie Tang, Penpen Miao, Xingyu Gao et al.

A method was proposed for the point cloud-based registration and image fusion between cardiac single photon emission computed tomography (SPECT) myocardial perfusion images (MPI) and cardiac computed tomography angiograms (CTA). Firstly, the left ventricle (LV) epicardial regions (LVERs) in SPECT and CTA images were segmented by using different U-Net neural networks trained to generate the point clouds of the LV epicardial contours (LVECs). Secondly, according to the characteristics of cardiac anatomy, the special points of anterior and posterior interventricular grooves (APIGs) were manually marked in both SPECT and CTA image volumes. Thirdly, we developed an in-house program for coarsely registering the special points of APIGs to ensure a correct cardiac orientation alignment between SPECT and CTA images. Fourthly, we employed ICP, SICP or CPD algorithm to achieve a fine registration for the point clouds (together with the special points of APIGs) of the LV epicardial surfaces (LVERs) in SPECT and CTA images. Finally, the image fusion between SPECT and CTA was realized after the fine registration. The experimental results showed that the cardiac orientation was aligned well and the mean distance error of the optimal registration method (CPD with affine transform) was consistently less than 3 mm. The proposed method could effectively fuse the structures from cardiac CTA and SPECT functional images, and demonstrated a potential in assisting in accurate diagnosis of cardiac diseases by combining complementary advantages of the two imaging modalities.

CVFeb 23, 2022
Absolute Zero-Shot Learning

Rui Gao, Fan Wan, Daniel Organisciak et al.

Considering the increasing concerns about data copyright and privacy issues, we present a novel Absolute Zero-Shot Learning (AZSL) paradigm, i.e., training a classifier with zero real data. The key innovation is to involve a teacher model as the data safeguard to guide the AZSL model training without data leaking. The AZSL model consists of a generator and student network, which can achieve date-free knowledge transfer while maintaining the performance of the teacher network. We investigate `black-box' and `white-box' scenarios in AZSL task as different levels of model security. Besides, we also provide discussion of teacher model in both inductive and transductive settings. Despite embarrassingly simple implementations and data-missing disadvantages, our AZSL framework can retain state-of-the-art ZSL and GZSL performance under the `white-box' scenario. Extensive qualitative and quantitative analysis also demonstrates promising results when deploying the model under `black-box' scenario.

OCSep 24, 2021
Sinkhorn Distributionally Robust Optimization

Jie Wang, Rui Gao, Yao Xie

We study distributionally robust optimization with Sinkhorn distance -- a variant of Wasserstein distance based on entropic regularization. We derive a convex programming dual reformulation for general nominal distributions, transport costs, and loss functions. To solve the dual reformulation, we develop a stochastic mirror descent algorithm with biased subgradient estimators and derive its computational complexity guarantees. Finally, we provide numerical examples using synthetic and real data to demonstrate its superior performance.

MEMay 20, 2021
Hierarchical Non-Stationary Temporal Gaussian Processes With $L^1$-Regularization

Zheng Zhao, Rui Gao, Simo Särkkä

This paper is concerned with regularized extensions of hierarchical non-stationary temporal Gaussian processes (NSGPs) in which the parameters (e.g., length-scale) are modeled as GPs. In particular, we consider two commonly used NSGP constructions which are based on explicitly constructed non-stationary covariance functions and stochastic differential equations, respectively. We extend these NSGPs by including $L^1$-regularization on the processes in order to induce sparseness. To solve the resulting regularized NSGP (R-NSGP) regression problem we develop a method based on the alternating direction method of multipliers (ADMM) and we also analyze its convergence properties theoretically. We also evaluate the performance of the proposed methods in simulated and real-world datasets.

MLFeb 5, 2021
Generalization Bounds for Noisy Iterative Algorithms Using Properties of Additive Noise Channels

Hao Wang, Rui Gao, Flavio P. Calmon

Machine learning models trained by different optimization algorithms under different data distributions can exhibit distinct generalization behaviors. In this paper, we analyze the generalization of models trained by noisy iterative algorithms. We derive distribution-dependent generalization bounds by connecting noisy iterative algorithms to additive noise channels found in communication and information theory. Our generalization bounds shed light on several applications, including differentially private stochastic gradient descent (DP-SGD), federated learning, and stochastic gradient Langevin dynamics (SGLD). We demonstrate our bounds through numerical experiments, showing that they can help understand recent empirical observations of the generalization phenomena of neural networks.

IVJan 11, 2021
Generalize Ultrasound Image Segmentation via Instant and Plug & Play Style Transfer

Zhendong Liu, Xiaoqiong Huang, Xin Yang et al.

Deep segmentation models that generalize to images with unknown appearance are important for real-world medical image analysis. Retraining models leads to high latency and complex pipelines, which are impractical in clinical settings. The situation becomes more severe for ultrasound image analysis because of their large appearance shifts. In this paper, we propose a novel method for robust segmentation under unknown appearance shifts. Our contribution is three-fold. First, we advance a one-stage plug-and-play solution by embedding hierarchical style transfer units into a segmentation architecture. Our solution can remove appearance shifts and perform segmentation simultaneously. Second, we adopt Dynamic Instance Normalization to conduct precise and dynamic style transfer in a learnable manner, rather than previously fixed style normalization. Third, our solution is fast and lightweight for routine clinical adoption. Given 400*400 image input, our solution only needs an additional 0.2ms and 1.92M FLOPs to handle appearance shifts compared to the baseline pipeline. Extensive experiments are conducted on a large dataset from three vendors demonstrate our proposed method enhances the robustness of deep segmentation models.

LGNov 8, 2020
Reliable Off-policy Evaluation for Reinforcement Learning

Jie Wang, Rui Gao, Hongyuan Zha

In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy using logged trajectory data generated from a different behavior policy, without execution of the target policy. Reinforcement learning in high-stake environments, such as healthcare and education, is often limited to off-policy settings due to safety or ethical concerns, or inability of exploration. Hence it is imperative to quantify the uncertainty of the off-policy estimate before deployment of the target policy. In this paper, we propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged trajectories data. Leveraging methodologies from distributionally robust optimization, we show that with proper selection of the size of the distributional uncertainty set, these estimates serve as confidence bounds with non-asymptotic and asymptotic guarantees under stochastic or adversarial environments. Our results are also generalized to batch reinforcement learning and are supported by empirical analysis.

MLOct 22, 2020
Two-sample Test using Projected Wasserstein Distance

Jie Wang, Rui Gao, Yao Xie

We develop a projected Wasserstein distance for the two-sample test, a fundamental problem in statistics and machine learning: given two sets of samples, to determine whether they are from the same distribution. In particular, we aim to circumvent the curse of dimensionality in Wasserstein distance: when the dimension is high, it has diminishing testing power, which is inherently due to the slow concentration property of Wasserstein metrics in the high dimension space. A key contribution is to couple optimal projection to find the low dimensional linear mapping to maximize the Wasserstein distance between projected probability distributions. We characterize the theoretical property of the finite-sample convergence rate on IPMs and present practical algorithms for computing this metric. Numerical examples validate our theoretical results.

IVOct 10, 2020
Contrastive Rendering for Ultrasound Image Segmentation

Haoming Li, Xin Yang, Jiamin Liang et al.

Ultrasound (US) image segmentation embraced its significant improvement in deep learning era. However, the lack of sharp boundaries in US images still remains an inherent challenge for segmentation. Previous methods often resort to global context, multi-scale cues or auxiliary guidance to estimate the boundaries. It is hard for these methods to approach pixel-level learning for fine-grained boundary generating. In this paper, we propose a novel and effective framework to improve boundary estimation in US images. Our work has three highlights. First, we propose to formulate the boundary estimation as a rendering task, which can recognize ambiguous points (pixels/voxels) and calibrate the boundary prediction via enriched feature representation learning. Second, we introduce point-wise contrastive learning to enhance the similarity of points from the same class and contrastively decrease the similarity of points from different classes. Boundary ambiguities are therefore further addressed. Third, both rendering and contrastive learning tasks contribute to consistent improvement while reducing network parameters. As a proof-of-concept, we performed validation experiments on a challenging dataset of 86 ovarian US volumes. Results show that our proposed method outperforms state-of-the-art methods and has the potential to be used in clinical practice.

LGSep 9, 2020
Finite-Sample Guarantees for Wasserstein Distributionally Robust Optimization: Breaking the Curse of Dimensionality

Rui Gao

Wasserstein distributionally robust optimization (DRO) aims to find robust and generalizable solutions by hedging against data perturbations in Wasserstein distance. Despite its recent empirical success in operations research and machine learning, existing performance guarantees for generic loss functions are either overly conservative due to the curse of dimensionality, or plausible only in large sample asymptotics. In this paper, we develop a non-asymptotic framework for analyzing the out-of-sample performance for Wasserstein robust learning and the generalization bound for its related Lipschitz and gradient regularization problems. To the best of our knowledge, this gives the first finite-sample guarantee for generic Wasserstein DRO problems without suffering from the curse of dimensionality. Our results highlight that Wasserstein DRO, with a properly chosen radius, balances between the empirical mean of the loss and the variation of the loss, measured by the Lipschitz norm or the gradient norm of the loss. Our analysis is based on two novel methodological developments that are of independent interest: 1) a new concentration inequality controlling the decay rate of large deviation probabilities by the variation of the loss and, 2) a localized Rademacher complexity theory based on the variation of the loss.

MLJun 7, 2020
Distributionally Robust Weighted $k$-Nearest Neighbors

Shixiang Zhu, Liyan Xie, Minghe Zhang et al.

Learning a robust classifier from a few samples remains a key challenge in machine learning. A major thrust of research has been focused on developing $k$-nearest neighbor ($k$-NN) based algorithms combined with metric learning that captures similarities between samples. When the samples are limited, robustness is especially crucial to ensure the generalization capability of the classifier. In this paper, we study a minimax distributionally robust formulation of weighted $k$-nearest neighbors, which aims to find the optimal weighted $k$-NN classifiers that hedge against feature uncertainties. We develop an algorithm, \texttt{Dr.k-NN}, that efficiently solves this functional optimization problem and features in assigning minimax optimal weights to training samples when performing classification. These weights are class-dependent, and are determined by the similarities of sample features under the least favorable scenarios. When the size of the uncertainty set is properly tuned, the robust classifier has a smaller Lipschitz norm than the vanilla $k$-NN, and thus improves the generalization capability. We also couple our framework with neural-network-based feature embedding. We demonstrate the competitive performance of our algorithm compared to the state-of-the-art in the few-training-sample setting with various real-data experiments.

IVFeb 14, 2020
Remove Appearance Shift for Ultrasound Image Segmentation via Fast and Universal Style Transfer

Zhendong Liu, Xin Yang, Rui Gao et al.

Deep Neural Networks (DNNs) suffer from the performance degradation when image appearance shift occurs, especially in ultrasound (US) image segmentation. In this paper, we propose a novel and intuitive framework to remove the appearance shift, and hence improve the generalization ability of DNNs. Our work has three highlights. First, we follow the spirit of universal style transfer to remove appearance shifts, which was not explored before for US images. Without sacrificing image structure details, it enables the arbitrary style-content transfer. Second, accelerated with Adaptive Instance Normalization block, our framework achieved real-time speed required in the clinical US scanning. Third, an efficient and effective style image selection strategy is proposed to ensure the target-style US image and testing content US image properly match each other. Experiments on two large US datasets demonstrate that our methods are superior to state-of-the-art methods on making DNNs robust against various appearance shifts.

LGSep 28, 2019
Bridging Explicit and Implicit Deep Generative Models via Neural Stein Estimators

Qitian Wu, Rui Gao, Hongyuan Zha

There are two types of deep generative models: explicit and implicit. The former defines an explicit density form that allows likelihood inference; while the latter targets a flexible transformation from random noise to generated samples. While the two classes of generative models have shown great power in many applications, both of them, when used alone, suffer from respective limitations and drawbacks. To take full advantages of both models and enable mutual compensation, we propose a novel joint training framework that bridges an explicit (unnormalized) density estimator and an implicit sample generator via Stein discrepancy. We show that our method 1) induces novel mutual regularization via kernel Sobolev norm penalization and Moreau-Yosida regularization, and 2) stabilizes the training dynamics. Empirically, we demonstrate that proposed method can facilitate the density estimator to more accurately identify data modes and guide the generator to output higher-quality samples, comparing with training a single counterpart. The new approach also shows promising results when the training samples are contaminated or limited.

MLMay 27, 2018
Robust Hypothesis Testing Using Wasserstein Uncertainty Sets

Rui Gao, Liyan Xie, Yao Xie et al.

We develop a novel computationally efficient and general framework for robust hypothesis testing. The new framework features a new way to construct uncertainty sets under the null and the alternative distributions, which are sets centered around the empirical distribution defined via Wasserstein metric, thus our approach is data-driven and free of distributional assumptions. We develop a convex safe approximation of the minimax formulation and show that such approximation renders a nearly-optimal detector among the family of all possible tests. By exploiting the structure of the least favorable distribution, we also develop a tractable reformulation of such approximation, with complexity independent of the dimension of observation space and can be nearly sample-size-independent in general. Real-data example using human activity data demonstrated the excellent performance of the new robust detector.

LGDec 17, 2017
Wasserstein Distributionally Robust Optimization and Variation Regularization

Rui Gao, Xi Chen, Anton J. Kleywegt

Wasserstein distributionally robust optimization (DRO) has recently achieved empirical success for various applications in operations research and machine learning, owing partly to its regularization effect. Although connection between Wasserstein DRO and regularization has been established in several settings, existing results often require restrictive assumptions, such as smoothness or convexity, that are not satisfied for many problems. In this paper, we develop a general theory on the variation regularization effect of the Wasserstein DRO - a new form of regularization that generalizes total-variation regularization, Lipschitz regularization and gradient regularization. Our results cover possibly non-convex and non-smooth losses and losses on non-Euclidean spaces. Examples include multi-item newsvendor, portfolio selection, linear prediction, neural networks, manifold learning, and intensity estimation for Poisson processes, etc. As an application of our theory of variation regularization, we derive new generalization guarantees for adversarial robust learning.

CVApr 18, 2017
Image Fusion With Cosparse Analysis Operator

Rui Gao, Sergiy A. Vorobyov, Hong Zhao

The paper addresses the image fusion problem, where multiple images captured with different focus distances are to be combined into a higher quality all-in-focus image. Most current approaches for image fusion strongly rely on the unrealistic noise-free assumption used during the image acquisition, and then yield limited robustness in fusion processing. In our approach, we formulate the multi-focus image fusion problem in terms of an analysis sparse model, and simultaneously perform the restoration and fusion of multi-focus images. Based on this model, we propose an analysis operator learning, and define a novel fusion function to generate an all-in-focus image. Experimental evaluations confirm the effectiveness of the proposed fusion approach both visually and quantitatively, and show that our approach outperforms state-of-the-art fusion methods.

AIMar 15, 2014
Sensing Subjective Well-being from Social Media

Bibo Hao, Lin Li, Rui Gao et al.

Subjective Well-being(SWB), which refers to how people experience the quality of their lives, is of great use to public policy-makers as well as economic, sociological research, etc. Traditionally, the measurement of SWB relies on time-consuming and costly self-report questionnaires. Nowadays, people are motivated to share their experiences and feelings on social media, so we propose to sense SWB from the vast user generated data on social media. By utilizing 1785 users' social media data with SWB labels, we train machine learning models that are able to "sense" individual SWB from users' social media. Our model, which attains the state-by-art prediction accuracy, can then be used to identify SWB of large population of social media users in time with very low cost.