Jia-Jie Zhu

h-index52

24papers

425citations

Novelty56%

AI Score46

Ranked #59,899 of 205,806 authors (top 29%)#147 in OC (top 17%)

24 Papers

LGJul 11, 2022

Functional Generalized Empirical Likelihood Estimation for Conditional Moment Restrictions

Heiner Kremer, Jia-Jie Zhu, Krikamol Muandet et al.

Important problems in causal inference, economics, and, more generally, robust machine learning can be expressed as conditional moment restrictions, but estimation becomes challenging as it requires solving a continuum of unconditional moment restrictions. Previous works addressed this problem by extending the generalized method of moments (GMM) to continuum moment restrictions. In contrast, generalized empirical likelihood (GEL) provides a more general framework and has been shown to enjoy favorable small-sample properties compared to GMM-based estimators. To benefit from recent developments in machine learning, we provide a functional reformulation of GEL in which arbitrary models can be leveraged. Motivated by a dual formulation of the resulting infinite dimensional optimization problem, we devise a practical method and explore its asymptotic properties. Finally, we provide kernel- and neural network-based implementations of the estimator, which achieve state-of-the-art empirical performance on two conditional moment restriction problems.

45.5MLMar 30

LDDMM stochastic interpolants: an application to domain uncertainty quantification in hemodynamics

Sarah Katz, Francesco Romor, Jia-Jie Zhu et al.

We introduce a novel conditional stochastic interpolant framework for generative modeling of three-dimensional shapes. The method builds on a recent LDDMM-based registration approach to learn the conditional drift between geometries. By leveraging the resulting pull-back and push-forward operators, we extend this formulation beyond standard Cartesian grids to complex shapes and random variables defined on distinct domains. We present an application in the context of cardiovascular simulations, where aortic shapes are generated from an initial cohort of patients. The conditioning variable is a latent geometric representation defined by a set of centerline points and the radii of the corresponding inscribed spheres. This methodology facilitates both data augmentation for three-dimensional biomedical shapes, and the generation of random perturbations of controlled magnitude for a given shape. These capabilities are essential for quantifying the impact of domain uncertainties arising from medical image segmentation on the estimation of relevant biomarkers.

OCApr 27, 2023

Propagating Kernel Ambiguity Sets in Nonlinear Data-driven Dynamics Models

Jia-Jie Zhu

This paper provides answers to an open problem: given a nonlinear data-driven dynamical system model, e.g., kernel conditional mean embedding (CME) and Koopman operator, how can one propagate the ambiguity sets forward for multiple steps? This problem is the key to solving distributionally robust control and learning-based control of such learned system models under a data-distribution shift. Different from previous works that use either static ambiguity sets, e.g., fixed Wasserstein balls, or dynamic ambiguity sets under known piece-wise linear (or affine) dynamics, we propose an algorithm that exactly propagates ambiguity sets through nonlinear data-driven models using the Koopman operator and CME, via the kernel maximum mean discrepancy geometry. Through both theoretical and numerical analysis, we show that our kernel ambiguity sets are the natural geometric structure for the learned data-driven dynamical system models.

MLOct 27, 2024

Kernel Approximation of Fisher-Rao Gradient Flows

Jia-Jie Zhu, Alexander Mielke

The purpose of this paper is to answer a few open questions in the interface of kernel methods and PDE gradient flows. Motivated by recent advances in machine learning, particularly in generative modeling and sampling, we present a rigorous investigation of Fisher-Rao and Wasserstein type gradient flows concerning their gradient structures, flow equations, and their kernel approximations. Specifically, we focus on the Fisher-Rao (also known as Hellinger) geometry and its various kernel-based approximations, developing a principled theoretical framework using tools from PDE gradient flows and optimal transport theory. We also provide a complete characterization of gradient flows in the maximum-mean discrepancy (MMD) space, with connections to existing learning and inference algorithms. Our analysis reveals precise theoretical insights linking Fisher-Rao flows, Stein flows, kernel discrepancies, and nonparametric regression. We then rigorously prove evolutionary $Γ$-convergence for kernel-approximated Fisher-Rao flows, providing theoretical guarantees beyond pointwise convergence. Finally, we analyze energy dissipation using the Helmholtz-Rayleigh principle, establishing important connections between classical theory in mechanics and modern machine learning practice. Our results provide a unified theoretical foundation for understanding and analyzing approximations of gradient flows in machine learning applications through a rigorous gradient flow and variational method perspective.

APJan 28, 2025

Hellinger-Kantorovich Gradient Flows: Global Exponential Decay of Entropy Functionals

Alexander Mielke, Jia-Jie Zhu

We investigate a family of gradient flows of positive and probability measures, focusing on the Hellinger-Kantorovich (HK) geometry, which unifies transport mechanism of Otto-Wasserstein, and the birth-death mechanism of Hellinger (or Fisher-Rao). A central contribution is a complete characterization of global exponential decay behaviors of entropy functionals (e.g. KL, $χ^2$) under Otto-Wasserstein and Hellinger-type gradient flows. In particular, for the more challenging analysis of HK gradient flows on positive measures -- where the typical log-Sobolev arguments fail -- we develop a specialized shape-mass decomposition that enables new analysis results. Our approach also leverages the (Polyak-)Łojasiewicz-type functional inequalities and a careful extension of classical dissipation estimates. These findings provide a unified and complete theoretical framework for gradient flows and underpin applications in computational algorithms for statistical inference, optimization, and machine learning.

OCFeb 29, 2024

Analysis of Kernel Mirror Prox for Measure Optimization

Pavel Dvurechensky, Jia-Jie Zhu

By choosing a suitable function space as the dual to the non-negative measure cone, we study in a unified framework a class of functional saddle-point optimization problems, which we term the Mixed Functional Nash Equilibrium (MFNE), that underlies several existing machine learning algorithms, such as implicit generative models, distributionally robust optimization (DRO), and Wasserstein barycenters. We model the saddle-point optimization dynamics as an interacting Fisher-Rao-RKHS gradient flow when the function space is chosen as a reproducing kernel Hilbert space (RKHS). As a discrete time counterpart, we propose a primal-dual kernel mirror prox (KMP) algorithm, which uses a dual step in the RKHS, and a primal entropic mirror prox step. We then provide a unified convergence analysis of KMP in an infinite-dimensional setting for this class of MFNE problems, which establishes a convergence rate of $O(1/N)$ in the deterministic case and $O(1/\sqrt{N})$ in the stochastic case, where $N$ is the iteration counter. As a case study, we apply our analysis to DRO, providing algorithmic guarantees for DRO robustness and convergence.

OCFeb 8, 2024

An Inexact Halpern Iteration with Application to Distributionally Robust Optimization

Ling Liang, Zusen Xu, Kim-Chuan Toh et al.

The Halpern iteration for solving monotone inclusion problems has gained increasing interests in recent years due to its simple form and appealing convergence properties. In this paper, we investigate the inexact variants of the scheme in both deterministic and stochastic settings. We conduct extensive convergence analysis and show that by choosing the inexactness tolerances appropriately, the inexact schemes admit an $O(k^{-1})$ convergence rate in terms of the (expected) residue norm. Our results relax the state-of-the-art inexactness conditions employed in the literature while sharing the same competitive convergence properties. We then demonstrate how the proposed methods can be applied for solving two classes of data-driven Wasserstein distributionally robust optimization problems that admit convex-concave min-max optimization reformulations. We highlight its capability of performing inexact computations for distributionally robust learning with stochastic first-order methods and for general nonlinear convex-concave loss functions, which are competitive in the literature.

MLOct 31, 2024

Inclusive KL Minimization: A Wasserstein-Fisher-Rao Gradient Flow Perspective

Jia-Jie Zhu

Otto's (2001) Wasserstein gradient flow of the exclusive KL divergence functional provides a powerful and mathematically principled perspective for analyzing learning and inference algorithms. In contrast, algorithms for the inclusive KL inference, i.e., minimizing $ \mathrm{KL}(π\| μ) $ with respect to $ μ$ for some target $ π$, are rarely analyzed using tools from mathematical analysis. This paper shows that a general-purpose approximate inclusive KL inference paradigm can be constructed using the theory of gradient flows derived from PDE analysis. We uncover that several existing learning algorithms can be viewed as particular realizations of the inclusive KL inference paradigm. For example, existing sampling algorithms such as Arbel et al. (2019) and Korba et al. (2021) can be viewed in a unified manner as inclusive-KL inference with approximate gradient estimators. Finally, we provide the theoretical foundation for the Wasserstein-Fisher-Rao gradient flows for minimizing the inclusive KL divergence.

OCSep 29, 2025

Improved Stochastic Optimization of LogSumExp

Egor Gladin, Alexey Kroshnin, Jia-Jie Zhu et al.

The LogSumExp function, also known as the free energy, plays a central role in many important optimization problems, including entropy-regularized optimal transport and distributionally robust optimization (DRO). It is also the dual to the Kullback-Leibler (KL) divergence, which is widely used in machine learning. In practice, when the number of exponential terms inside the logarithm is large or infinite, optimization becomes challenging since computing the gradient requires differentiating every term. Previous approaches that replace the full sum with a small batch introduce significant bias. We propose a novel approximation to LogSumExp that can be efficiently optimized using stochastic gradient methods. This approximation is rooted in a sound modification of the KL divergence in the dual, resulting in a new $f$-divergence called the safe KL divergence. The accuracy of the approximation is controlled by a tunable parameter and can be made arbitrarily small. Like the LogSumExp, our approximation preserves convexity. Moreover, when applied to an $L$-smooth function bounded from below, the smoothness constant of the resulting objective scales linearly with $L$. Experiments in DRO and continuous optimal transport demonstrate the advantages of our approach over state-of-the-art baselines and the effective treatment of numerical issues associated with the standard LogSumExp and KL.

LGMay 18, 2023

Estimation Beyond Data Reweighting: Kernel Method of Moments

Heiner Kremer, Yassine Nemmour, Bernhard Schölkopf et al.

Moment restrictions and their conditional counterparts emerge in many areas of machine learning and statistics ranging from causal inference to reinforcement learning. Estimators for these tasks, generally called methods of moments, include the prominent generalized method of moments (GMM) which has recently gained attention in causal inference. GMM is a special case of the broader family of empirical likelihood estimators which are based on approximating a population distribution by means of minimizing a $\varphi$-divergence to an empirical distribution. However, the use of $\varphi$-divergences effectively limits the candidate distributions to reweightings of the data samples. We lift this long-standing limitation and provide a method of moments that goes beyond data reweighting. This is achieved by defining an empirical likelihood estimator based on maximum mean discrepancy which we term the kernel method of moments (KMM). We provide a variant of our estimator for conditional moment restrictions and show that it is asymptotically first-order optimal for such problems. Finally, we show that our method achieves competitive performance on several conditional moment restriction tasks.

LGJun 24, 2021

Shallow Representation is Deep: Learning Uncertainty-aware and Worst-case Random Feature Dynamics

Diego Agudelo-España, Yassine Nemmour, Bernhard Schölkopf et al.

Random features is a powerful universal function approximator that inherits the theoretical rigor of kernel methods and can scale up to modern learning tasks. This paper views uncertain system models as unknown or uncertain smooth functions in universal reproducing kernel Hilbert spaces. By directly approximating the one-step dynamics function using random features with uncertain parameters, which are equivalent to a shallow Bayesian neural network, we then view the whole dynamical system as a multi-layer neural network. Exploiting the structure of Hamiltonian dynamics, we show that finding worst-case dynamics realizations using Pontryagin's minimum principle is equivalent to performing the Frank-Wolfe algorithm on the deep net. Various numerical experiments on dynamics learning showcase the capacity of our modeling methodology.

SYMar 29, 2021

Distributionally Robust Trajectory Optimization Under Uncertain Dynamics via Relative Entropy Trust-Regions

Hany Abdulsamad, Tim Dorau, Boris Belousov et al.

Trajectory optimization and model predictive control are essential techniques underpinning advanced robotic applications, ranging from autonomous driving to full-body humanoid control. State-of-the-art algorithms have focused on data-driven approaches that infer the system dynamics online and incorporate posterior uncertainty during planning and control. Despite their success, such approaches are still susceptible to catastrophic errors that may arise due to statistical learning biases, unmodeled disturbances, or even directed adversarial attacks. In this paper, we tackle the problem of dynamics mismatch and propose a distributionally robust optimal control formulation that alternates between two relative entropy trust-region optimization problems. Our method finds the worst-case maximum entropy Gaussian posterior over the dynamics parameters and the corresponding robust policy. Furthermore, we show that our approach admits a closed-form backward-pass for a certain class of systems. Finally, we demonstrate the resulting robustness on linear and nonlinear numerical examples.

LGFeb 16, 2021

Adversarially Robust Kernel Smoothing

Jia-Jie Zhu, Christina Kouridi, Yassine Nemmour et al.

We propose a scalable robust learning algorithm combining kernel smoothing and robust optimization. Our method is motivated by the convex analysis perspective of distributionally robust optimization based on probability metrics, such as the Wasserstein distance and the maximum mean discrepancy. We adapt the integral operator using supremal convolution in convex analysis to form a novel function majorant used for enforcing robustness. Our method is simple in form and applies to general loss functions and machine learning models. Exploiting a connection with optimal transport, we prove theoretical guarantees for certified robustness under distribution shift. Furthermore, we report experiments with general machine learning models, such as deep neural networks, to demonstrate competitive performance with the state-of-the-art certifiable robust learning algorithms based on the Wasserstein distance.

OCJun 12, 2020

Kernel Distributionally Robust Optimization

Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl et al.

We propose kernel distributionally robust optimization (Kernel DRO) using insights from the robust optimization theory and functional analysis. Our method uses reproducing kernel Hilbert spaces (RKHS) to construct a wide range of convex ambiguity sets, which can be generalized to sets based on integral probability metrics and finite-order moment bounds. This perspective unifies multiple existing robust and stochastic optimization methods. We prove a theorem that generalizes the classical duality in the mathematical problem of moments. Enabled by this theorem, we reformulate the maximization with respect to measures in DRO into the dual program that searches for RKHS functions. Using universal RKHSs, the theorem applies to a broad class of loss functions, lifting common limitations such as polynomial losses and knowledge of the Lipschitz constant. We then establish a connection between DRO and stochastic optimization with expectation constraints. Finally, we propose practical algorithms based on both batch convex solvers and stochastic functional gradient, which apply to general optimization and machine learning tasks.

OCMar 31, 2020

Worst-Case Risk Quantification under Distributional Ambiguity using Kernel Mean Embedding in Moment Problem

Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl et al.

In order to anticipate rare and impactful events, we propose to quantify the worst-case risk under distributional ambiguity using a recent development in kernel methods -- the kernel mean embedding. Specifically, we formulate the generalized moment problem whose ambiguity set (i.e., the moment constraint) is described by constraints in the associated reproducing kernel Hilbert space in a nonparametric manner. We then present the tractable approximation and its theoretical justification. As a concrete application, we numerically test the proposed method in characterizing the worst-case constraint violation probability in the context of a constrained stochastic control system.

OCJan 28, 2020

A Kernel Mean Embedding Approach to Reducing Conservativeness in Stochastic Programming and Control

Jia-Jie Zhu, Moritz Diehl, Bernhard Schölkopf

We apply kernel mean embedding methods to sample-based stochastic optimization and control. Specifically, we use the reduced-set expansion method as a way to discard sampled scenarios. The effect of such constraint removal is improved optimality and decreased conservativeness. This is achieved by solving a distributional-distance-regularized optimization problem. We demonstrated this optimization formulation is well-motivated in theory, computationally tractable and effective in numerical algorithms.

MLNov 25, 2019

A New Distribution-Free Concept for Representing, Comparing, and Propagating Uncertainty in Dynamical Systems with Kernel Probabilistic Programming

Jia-Jie Zhu, Krikamol Muandet, Moritz Diehl et al.

This work presents the concept of kernel mean embedding and kernel probabilistic programming in the context of stochastic systems. We propose formulations to represent, compare, and propagate uncertainties for fairly general stochastic dynamics in a distribution-free manner. The new tools enjoy sound theory rooted in functional analysis and wide applicability as demonstrated in distinct numerical examples. The implication of this new concept is a new mode of thinking about the statistical nature of uncertainty in dynamical systems.

OCNov 20, 2019

Fast Non-Parametric Learning to Accelerate Mixed-Integer Programming for Online Hybrid Model Predictive Control

Jia-Jie Zhu, Georg Martius

Today's fast linear algebra and numerical optimization tools have pushed the frontier of model predictive control (MPC) forward, to the efficient control of highly nonlinear and hybrid systems. The field of hybrid MPC has demonstrated that exact optimal control law can be computed, e.g., by mixed-integer programming (MIP) under piecewise-affine (PWA) system models. Despite the elegant theory, online solving hybrid MPC is still out of reach for many applications. We aim to speed up MIP by combining geometric insights from hybrid MPC, a simple-yet-effective learning algorithm, and MIP warm start techniques. Following a line of work in approximate explicit MPC, the proposed learning-control algorithm, LNMS, gains computational advantage over MIP at little cost and is straightforward for practitioners to implement.

ROJul 10, 2019

Robust Humanoid Locomotion Using Trajectory Optimization and Sample-Efficient Learning

Mohammad Hasan Yeganegi, Majid Khadiv, S. Ali A. Moosavian et al.

Trajectory optimization (TO) is one of the most powerful tools for generating feasible motions for humanoid robots. However, including uncertainties and stochasticity in the TO problem to generate robust motions can easily lead to intractable problems. Furthermore, since the models used in TO have always some level of abstraction, it can be hard to find a realistic set of uncertainties in the model space. In this paper we leverage a sample-efficient learning technique (Bayesian optimization) to robustify TO for humanoid locomotion. The main idea is to use data from full-body simulations to make the TO stage robust by tuning the cost weights. To this end, we split the TO problem into two phases. The first phase solves a convex optimization problem for generating center of mass (CoM) trajectories based on simplified linear dynamics. The second stage employs iterative Linear-Quadratic Gaussian (iLQG) as a whole-body controller to generate full body control inputs. Then we use Bayesian optimization to find the cost weights to use in the first stage that yields robust performance in the simulation/experiment, in the presence of different disturbance/uncertainties. The results show that the proposed approach is able to generate robust motions for different sets of disturbances and uncertainties.

LGJun 19, 2019

Control What You Can: Intrinsically Motivated Task-Planning Agent

Sebastian Blaes, Marin Vlastelica Pogančić, Jia-Jie Zhu et al.

We present a novel intrinsically motivated agent that learns how to control the environment in the fastest possible manner by optimizing learning progress. It learns what can be controlled, how to allocate time and attention, and the relations between objects using surprise based motivation. The effectiveness of our method is demonstrated in a synthetic as well as a robotic manipulation environment yielding considerably improved performance and smaller sample complexity. In a nutshell, our work combines several task-level planning agent structures (backtracking search on task graph, probabilistic road-maps, allocation of search efforts) with intrinsic motivation to achieve learning from scratch.

ROJun 9, 2019

Trajectory Optimization for Robust Humanoid Locomotion with Sample-Efficient Learning

Majid Khadiv, Mohammad Hasan Yeganegi, S. Ali A. Moosavian et al.

NAApr 6, 2019

Projection Algorithms for Non-Convex Minimization with Application to Sparse Principal Component Analysis

William W. Hager, Dzung T. Phan, Jia-Jie Zhu

We consider concave minimization problems over non-convex sets.Optimization problems with this structure arise in sparse principal component analysis. We analyze both a gradient projection algorithm and an approximate Newton algorithm where the Hessian approximation is a multiple of the identity. Convergence results are established. In numerical experiments arising in sparse principal component analysis, it is seen that the performance of the gradient projection algorithm is very similar to that of the truncated power method and the generalized power method. In some cases, the approximate Newton algorithm with a Barzilai-Borwein (BB) Hessian approximation can be substantially faster than the other algorithms, and can converge to a better solution.

SYSep 13, 2018

Deep Reinforcement Learning for Event-Triggered Control

Dominik Baumann, Jia-Jie Zhu, Georg Martius et al.

Event-triggered control (ETC) methods can achieve high-performance control with a significantly lower number of samples compared to usual, time-triggered methods. These frameworks are often based on a mathematical model of the system and specific designs of controller and event trigger. In this paper, we show how deep reinforcement learning (DRL) algorithms can be leveraged to simultaneously learn control and communication behavior from scratch, and present a DRL approach that is particularly suitable for ETC. To our knowledge, this is the first work to apply DRL to ETC. We validate the approach on multiple control tasks and compare it to model-based event-triggering frameworks. In particular, we demonstrate that it can, other than many model-based ETC designs, be straightforwardly applied to nonlinear systems.

LGFeb 25, 2017

Generative Adversarial Active Learning

Jia-Jie Zhu, José Bento

We propose a new active learning by query synthesis approach using Generative Adversarial Networks (GAN). Different from regular active learning, the resulting algorithm adaptively synthesizes training instances for querying to increase learning speed. We generate queries according to the uncertainty principle, but our idea can work with other active learning principles. We report results from various numerical experiments to demonstrate the effectiveness the proposed approach. In some settings, the proposed algorithm outperforms traditional pool-based approaches. To the best our knowledge, this is the first active learning work using GAN.