NADec 17, 2012
Sparse Dynamics for Partial Differential EquationsHayden Schaeffer, Stanley Osher, Russel Caflisch et al.
We investigate the approximate dynamics of several differential equations when the solutions are restricted to a sparse subset of a given basis. The restriction is enforced at every time step by simply applying soft thresholding to the coefficients of the basis approximation. By reducing or compressing the information needed to represent the solution at every step, only the essential dynamics are represented. In many cases, there are natural bases derived from the differential equations which promote sparsity. We find that our method successfully reduces the dynamics of convection equations, diffusion equations, weak shocks, and vorticity equations with high frequency source terms.
NANov 28, 2018
Computations of optimal transport distance with Fisher information regularizationWuchen Li, Penghang Yin, Stanley Osher
We propose a fast algorithm to approximate the optimal transport distance. The main idea is to add a Fisher information regularization into the dynamical setting of the problem, originated by Benamou and Brenier. The regularized problem is shown to be smooth and strictly convex, thus many classical fast algorithms are available. In this paper, we adopt Newton's method, which converges to the minimizer with a quadratic rate. Several numerical examples are provided.
OCJun 17, 2018
Vector and Matrix Optimal Mass Transport: Theory, Algorithm, and ApplicationsErnest K. Ryu, Yongxin Chen, Wuchen Li et al.
In many applications such as color image processing, data has more than one piece of information associated with each spatial coordinate, and in such cases the classical optimal mass transport (OMT) must be generalized to handle vector-valued or matrix-valued densities. In this paper, we discuss the vector and matrix optimal mass transport and present three contributions. We first present a rigorous mathematical formulation for these setups and provide analytical results including existence of solutions and strong duality. Next, we present a simple, scalable, and parallelizable methods to solve the vector and matrix-OMT problems. Finally, we implement the proposed methods on a CUDA GPU and present experiments and applications.
NAMay 4, 2018
Algorithm for Hamilton-Jacobi equations in density space via a generalized Hopf formulaYat Tin Chow, Wuchen Li, Stanley Osher et al.
We design fast numerical methods for Hamilton-Jacobi equations in density space (HJD), which arises in optimal transport and mean field games. We overcome the curse-of-infinite-dimensionality nature of HJD by proposing a generalized Hopf formula in density space. The formula transfers optimal control problems in density space, which are constrained minimizations supported on both spatial and time variables, to optimization problems over only spatial variables. This transformation allows us to compute HJD efficiently via multi-level approaches and coordinate descent methods.
OCDec 16, 2017
Time-Optimal Collaborative Guidance Using the Generalized Hopf FormulaMatthew R. Kirchner, Robert Mar, Gary Hewer et al.
Presented is a new method for calculating the time-optimal guidance control for a multiple vehicle pursuit-evasion system. A joint differential game of k pursuing vehicles relative to the evader is constructed, and a Hamilton-Jacobi-Isaacs (HJI) equation that describes the evolution of the value function is formulated. The value function is built such that the terminal cost is the squared distance from the boundary of the terminal surface. Additionally, all vehicles are assumed to have bounded controls. Typically, a joint state space constructed in this way would have too large a dimension to be solved with existing grid-based approaches. The value function is computed efficiently in high-dimensional space, without a discrete grid, using the generalized Hopf formula. The optimal time-to-reach is iteratively solved, and the optimal control is inferred from the gradient of the value function.
NADec 14, 2016
A fast algorithm for Earth Mover's Distance based on optimal transport and L1 type RegularizationWuchen Li, Stanley Osher, Wilfrid Gangbo
We propose a new algorithm to approximate the Earth Mover's distance (EMD). Our main idea is motivated by the theory of optimal transport, in which EMD can be reformulated as a familiar $L_1$ type minimization. We use a regularization which gives us a unique solution for this $L_1$ type problem. The new regularized minimization is very similar to problems which have been solved in the fields of compressed sensing and image processing, where several fast methods are available. In this paper, we adopt a primal-dual algorithm designed there, which uses very simple updates at each iteration and is shown to converge very rapidly. Several numerical examples are provided.
LGNov 28, 2022
Revisiting Over-smoothing and Over-squashing Using Ollivier-Ricci CurvatureKhang Nguyen, Hieu Nong, Vinh Nguyen et al.
Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness in taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues, thereby providing a unified framework for studying them at a local scale using the Ollivier-Ricci curvature. Specifically, we demonstrate that over-smoothing is linked to positive graph curvature while over-squashing is linked to negative graph curvature. Based on our theory, we propose the Batch Ollivier-Ricci Flow, a novel rewiring algorithm capable of simultaneously addressing both over-smoothing and over-squashing.
SYJun 13, 2018
A Primal-Dual Method for Optimal Control and Trajectory Generation in High-Dimensional SystemsMatthew R. Kirchner, Gary Hewer, Jerome Darbon et al.
Presented is a method for efficient computation of the Hamilton-Jacobi (HJ) equation for time-optimal control problems using the generalized Hopf formula. Typically, numerical methods to solve the HJ equation rely on a discrete grid of the solution space and exhibit exponential scaling with dimension. The generalized Hopf formula avoids the use of grids and numerical gradients by formulating an unconstrained convex optimization problem. The solution at each point is completely independent, and allows a massively parallel implementation if solutions at multiple points are desired. This work presents a primal-dual method for efficient numeric solution and presents how the resulting optimal trajectory can be generated directly from the solution of the Hopf formula, without further optimization. Examples presented have execution times on the order of milliseconds and experiments show computation scales approximately polynomial in dimension with very small high-order coefficients.
OCMay 18, 2022
Neural ODE Control for Trajectory Approximation of Continuity EquationKarthik Elamvazhuthi, Bahman Gharesifard, Andrea Bertozzi et al.
We consider the controllability problem for the continuity equation, corresponding to neural ordinary differential equations (ODEs), which describes how a probability measure is pushedforward by the flow. We show that the controlled continuity equation has very strong controllability properties. Particularly, a given solution of the continuity equation corresponding to a bounded Lipschitz vector field defines a trajectory on the set of probability measures. For this trajectory, we show that there exist piecewise constant training weights for a neural ODE such that the solution of the continuity equation corresponding to the neural ODE is arbitrarily close to it. As a corollary to this result, we establish that the continuity equation of the neural ODE is approximately controllable on the set of compactly supported probability measures that are absolutely continuous with respect to the Lebesgue measure.
OCApr 1, 2011
A nonlinear PDE-based method for sparse deconvolutionYu Mao, Bin Dong, Stanley Osher
In this paper, we introduce a new nonlinear evolution partial differential equation for sparse deconvolution problems. The proposed PDE has the form of continuity equation that arises in various research areas, e.g. fluid dynamics and optimal transportation, and thus has some interesting physical and geometric interpretations. The underlying optimization model that we consider is the standard $\ell_1$ minimization with linear equality constraints, i.e. $\min_u\{\|u\|_1 : Au=f\}$ with $A$ being an under-sampled convolution operator. We show that our PDE preserves the $\ell_1$ norm while lowering the residual $\|Au-f\|_2$. More importantly the solution of the PDE becomes sparser asymptotically, which is illustrated numerically. Therefore, it can be treated as a natural and helpful plug-in to some algorithms for $\ell_1$ minimization problems, e.g. Bregman iterative methods introduced for sparse reconstruction problems in [W. Yin, S. Osher, D. Goldfarb, and J. Darbon,SIAM J. Imaging Sci., 1 (2008), pp. 143-168]. Numerical experiments show great improvements in terms of both convergence speed and reconstruction quality.
OCMay 23, 2018
Solving Large-Scale Optimization Problems with a Convergence Rate Independent of Grid SizeMatt Jacobs, Flavien Léger, Wuchen Li et al.
We present a primal-dual method to solve L1-type non-smooth optimization problems independently of the grid size. We apply these results to two important problems : the Rudin-Osher-Fatemi image denoising model and the L1 earth mover's distance from optimal transport. Crucially, we provide analysis that determines the choice of optimal step sizes and we prove that our method converges independently of the grid size. Our approach allows us to solve these problems on grids as large as 4096 by 4096 in a few minutes without parallelization.
LGJan 1, 2023
Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced DataHien Dang, Tho Tran, Stanley Osher et al.
Modern deep neural networks have achieved impressive performance on tasks from image classification to natural language processing. Surprisingly, these complex systems with massive amounts of parameters exhibit the same structural properties in their last-layer features and classifiers across canonical datasets when training until convergence. In particular, it has been observed that the last-layer features collapse to their class-means, and those class-means are the vertices of a simplex Equiangular Tight Frame (ETF). This phenomenon is known as Neural Collapse (NC). Recent papers have theoretically shown that NC emerges in the global minimizers of training problems with the simplified "unconstrained feature model". In this context, we take a step further and prove the NC occurrences in deep linear networks for the popular mean squared error (MSE) and cross entropy (CE) losses, showing that global solutions exhibit NC properties across the linear layers. Furthermore, we extend our study to imbalanced data for MSE loss and present the first geometric analysis of NC under bias-free setting. Our results demonstrate the convergence of the last-layer features and classifiers to a geometry consisting of orthogonal vectors, whose lengths depend on the amount of data in their corresponding classes. Finally, we empirically validate our theoretical analyses on synthetic and practical network architectures with both balanced and imbalanced scenarios.
OCNov 30, 2022
Taming Hyperparameter Tuning in Continuous Normalizing Flows Using the JKO SchemeAlexander Vidal, Samy Wu Fung, Luis Tenorio et al.
A normalizing flow (NF) is a mapping that transforms a chosen probability distribution to a normal distribution. Such flows are a common technique used for data generation and density estimation in machine learning and data science. The density estimate obtained with a NF requires a change of variables formula that involves the computation of the Jacobian determinant of the NF transformation. In order to tractably compute this determinant, continuous normalizing flows (CNF) estimate the mapping and its Jacobian determinant using a neural ODE. Optimal transport (OT) theory has been successfully used to assist in finding CNFs by formulating them as OT problems with a soft penalty for enforcing the standard normal distribution as a target measure. A drawback of OT-based CNFs is the addition of a hyperparameter, $α$, that controls the strength of the soft penalty and requires significant tuning. We present JKO-Flow, an algorithm to solve OT-based CNF without the need of tuning $α$. This is achieved by integrating the OT CNF framework into a Wasserstein gradient flow framework, also known as the JKO scheme. Instead of tuning $α$, we repeatedly solve the optimization problem for a fixed $α$ effectively performing a JKO update with a time-step $α$. Hence we obtain a "divide and conquer" algorithm by repeatedly solving simpler problems instead of solving a potentially harder problem with large $α$.
MLAug 28, 2023
Noise-Free Sampling Algorithms via Regularized Wasserstein ProximalsHong Ye Tan, Stanley Osher, Wuchen Li
We consider the problem of sampling from a distribution governed by a potential function. This work proposes an explicit score based MCMC method that is deterministic, resulting in a deterministic evolution for particles rather than a stochastic differential equation evolution. The score term is given in closed form by a regularized Wasserstein proximal, using a kernel convolution that is approximated by sampling. We demonstrate fast convergence on various problems and show improved dimensional dependence of mixing time bounds for the case of Gaussian distributions compared to the unadjusted Langevin algorithm (ULA) and the Metropolis-adjusted Langevin algorithm (MALA). We additionally derive closed form expressions for the distributions at each iterate for quadratic potential functions, characterizing the variance reduction. Empirical results demonstrate that the particles behave in an organized manner, lying on level set contours of the potential. Moreover, the posterior mean estimator of the proposed method is shown to be closer to the maximum a-posteriori estimator compared to ULA and MALA in the context of Bayesian logistic regression. Additional examples demonstrate competitive performance for Bayesian neural network training.
LGSep 29, 2022
Improving Generative Flow Networks with Path RegularizationAnh Do, Duy Dinh, Tan Nguyen et al.
Generative Flow Networks (GFlowNets) are recently proposed models for learning stochastic policies that generate compositional objects by sequences of actions with the probability proportional to a given reward function. The central problem of GFlowNets is to improve their exploration and generalization. In this work, we propose a novel path regularization method based on optimal transport theory that places prior constraints on the underlying structure of the GFlowNets. The prior is designed to help the GFlowNets better discover the latent structure of the target distribution or enhance its ability to explore the environment in the context of active learning. The path regularization controls the flow in GFlowNets to generate more diverse and novel candidates via maximizing the optimal transport distances between two forward policies or to improve the generalization via minimizing the optimal transport distances. In addition, we derive an efficient implementation of the regularization by finding its closed form solutions in specific cases and a meaningful upper bound that can be used as an approximation to minimize the regularization term. We empirically demonstrate the advantage of our path regularization on a wide range of tasks, including synthetic hypergrid environment modeling, discrete probabilistic modeling, and biological sequence design.
94.8OCApr 25
On the Convergence of Jacobian-Free Backpropagation for Optimal Control Problems with Implicit HamiltoniansEric Gelphman, Deepanshu Verma, Nicole Tianjiao Yang et al.
Optimal feedback control with implicit Hamiltonians poses a fundamental challenge for learning-based value function methods due to the absence of closed-form optimal control laws. Recent work~\cite{gelphman2025end} introduced an implicit deep learning approach using Jacobian-Free Backpropagation (JFB) to address this setting, but only established sample-wise descent guarantees. In this paper, we establish convergence guarantees for JFB in the stochastic minibatch setting, showing that the resulting updates converge to stationary points of the expected optimal control objective. We further demonstrate scalability on substantially higher-dimensional problems, including multi-agent optimal consumption and swarm-based quadrotor and bicycle control. Together, our results provide both theoretical justification and empirical evidence for using JFB in high-dimensional optimal control with implicit Hamiltonians.
95.7LGMay 17
Dimension-Free Convergence of Discrete Diffusion Models: Adjoint Equations Induce the Right SpaceKelvin Kan, Xingjian Li, Benjamin J. Zhang et al.
Discrete diffusion has become a leading framework for generative modeling in various applications including language, vision, and biology. Existing convergence theory, however, exhibits fundamental limitations. KL-based analyses diverge under singular priors such as the masked distribution, while bounds in total variation (TV) depend on the state space size $S$ and become vacuous for modern language tasks, where vocabularies contain hundreds of thousands of tokens. We develop a unified adjoint-equation-based framework that establishes dimension-free convergence guarantees in any integral probability metric (IPM). To the best of our knowledge, our bounds are the first to be entirely free of $S$ and applicable to both masked and uniform priors. Importantly, our theory relies only on a single standard rate-matrix regularity assumption and is compatible with time-inhomogeneous schedules. Four novel techniques drive our improvements: working in the space of observables via adjoint equations rather than directly with probability measures, a regularity analysis that yields bounds on any IPM, a coupling argument that removes $S$-dependence under uniform transitions, and a score-marginal cancellation technique that removes $S$-dependence under masked transitions. Our framework thus sharply departs from prior analyses and avoids the shortcomings of pathspace-KL and existing TV-based approaches. Beyond convergence bounds, our framework provides a versatile toolkit for further theoretical study of discrete diffusion models.
OCSep 24, 2024
Score-based Neural Ordinary Differential Equations for Computing Mean Field Control ProblemsMo Zhou, Stanley Osher, Wuchen Li
Classical neural ordinary differential equations (ODEs) are powerful tools for approximating the log-density functions in high-dimensional spaces along trajectories, where neural networks parameterize the velocity fields. This paper proposes a system of neural differential equations representing first- and second-order score functions along trajectories based on deep neural networks. We reformulate the mean field control (MFC) problem with individual noises into an unconstrained optimization problem framed by the proposed neural ODE system. Additionally, we introduce a novel regularization term to enforce characteristics of viscous Hamilton--Jacobi--Bellman (HJB) equations to be satisfied based on the evolution of the second-order score function. Examples include regularized Wasserstein proximal operators (RWPOs), probability flow matching of Fokker--Planck (FP) equations, and linear quadratic (LQ) MFC problems, which demonstrate the effectiveness and accuracy of the proposed method.
LGNov 25, 2024Code
VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics PredictionYadi Cao, Yuxuan Liu, Liu Yang et al.
In-Context Operator Networks (ICONs) have demonstrated the ability to learn operators across diverse partial differential equations using few-shot, in-context learning. However, existing ICONs process each spatial point as an individual token, severely limiting computational efficiency when handling dense data in higher spatial dimensions. We propose Vision In-Context Operator Networks (VICON), which integrates vision transformer architectures to efficiently process 2D data through patch-wise operations while preserving ICON's adaptability to multiphysics systems and varying timesteps. Evaluated across three fluid dynamics benchmarks, VICON significantly outperforms state-of-the-art baselines: DPOT and MPP, reducing the averaged last-step rollout error by 37.9% compared to DPOT and 44.7% compared to MPP, while requiring only 72.5% and 34.8% of their respective inference times. VICON naturally supports flexible rollout strategies with varying timestep strides, enabling immediate deployment in imperfect measurement systems where sampling frequencies may differ or frames might be dropped - common challenges in real-world settings - without requiring retraining or interpolation. In these realistic scenarios, VICON exhibits remarkable robustness, experiencing only 24.41% relative performance degradation compared to 71.37%-74.49% degradation in baseline methods, demonstrating its versatility for deploying in realistic applications. Our scripts for processing datasets and code are publicly available at https://github.com/Eydcao/VICON.
LGFeb 3
SymPlex: A Structure-Aware Transformer for Symbolic PDE SolvingYesom Park, Annie C. Lu, Shao-Ching Huang et al.
We propose SymPlex, a reinforcement learning framework for discovering analytical symbolic solutions to partial differential equations (PDEs) without access to ground-truth expressions. SymPlex formulates symbolic PDE solving as tree-structured decision-making and optimizes candidate solutions using only the PDE and its boundary conditions. At its core is SymFormer, a structure-aware Transformer that models hierarchical symbolic dependencies via tree-relative self-attention and enforces syntactic validity through grammar-constrained autoregressive decoding, overcoming the limited expressivity of sequence-based generators. Unlike numerical and neural approaches that approximate solutions in discretized or implicit function spaces, SymPlex operates directly in symbolic expression space, enabling interpretable and human-readable solutions that naturally represent non-smooth behavior and explicit parametric dependence. Empirical results demonstrate exact recovery of non-smooth and parametric PDE solutions using deep learning-based symbolic methods.
83.2OCMay 11
Fixed-Point Neural Optimal Transport without Implicit DifferentiationYesom Park, Eric Gelphman, Stanley Osher et al.
We propose an implicit neural formulation of optimal transport that eliminates adversarial min--max optimization and multi-network architectures commonly used in existing approaches. Our key idea is to parameterize a single potential in the Kantorovich dual and reformulate the associated c-transform as a proximal fixed-point problem. This yields a stable single-network framework in which dual feasibility is enforced exactly through proximal optimality conditions rather than adversarial training. Despite the inner fixed-point computation, gradients can be computed without differentiating through the fixed-point iterations, enabling efficient training without requiring implicit differentiation. We further establish convergence of stochastic gradient descent. The resulting framework is efficient, scalable, and broadly applicable: it simultaneously recovers forward and backward transport maps and naturally extends to class-conditional settings. Experiments on high-dimensional Gaussian benchmarks, physical datasets, and image translation tasks demonstrate strong transport accuracy together with improved training stability and favorable computational and memory efficiency.
LGNov 2, 2019Code
Laplacian Smoothing Stochastic Gradient Markov Chain Monte CarloBao Wang, Difan Zou, Quanquan Gu et al.
As an important Markov Chain Monte Carlo (MCMC) method, stochastic gradient Langevin dynamics (SGLD) algorithm has achieved great success in Bayesian learning and posterior sampling. However, SGLD typically suffers from slow convergence rate due to its large variance caused by the stochastic gradient. In order to alleviate these drawbacks, we leverage the recently developed Laplacian Smoothing (LS) technique and propose a Laplacian smoothing stochastic gradient Langevin dynamics (LS-SGLD) algorithm. We prove that for sampling from both log-concave and non-log-concave densities, LS-SGLD achieves strictly smaller discretization error in $2$-Wasserstein distance, although its mixing rate can be slightly slower. Experiments on both synthetic and real datasets verify our theoretical results, and demonstrate the superior performance of LS-SGLD on different machine learning tasks including posterior sampling, Bayesian logistic regression and training Bayesian convolutional neural networks. The code is available at \url{https://github.com/BaoWangMath/LS-MCMC}.
LGJun 17, 2018Code
Laplacian Smoothing Gradient DescentStanley Osher, Bao Wang, Penghang Yin et al.
We propose a class of very simple modifications of gradient descent and stochastic gradient descent. We show that when applied to a large variety of machine learning problems, ranging from logistic regression to deep neural nets, the proposed surrogates can dramatically reduce the variance, allow to take a larger step size, and improve the generalization accuracy. The methods only involve multiplying the usual (stochastic) gradient by the inverse of a positive definitive matrix (which can be computed efficiently by FFT) with a low condition number coming from a one-dimensional discrete Laplacian or its high order generalizations. It also preserves the mean and increases the smallest component and decreases the largest component. The theory of Hamilton-Jacobi partial differential equations demonstrates that the implicit version of the new algorithm is almost the same as doing gradient descent on a new function which (i) has the same global minima as the original function and (ii) is ``more convex". Moreover, we show that optimization algorithms with these surrogates converge uniformly in the discrete Sobolev $H_σ^p$ sense and reduce the optimality gap for convex optimization problems. The code is available at: \url{https://github.com/BaoWangMath/LaplacianSmoothing-GradientDescent}
FAOct 31, 2024
2D Empirical Transforms. Wavelets, Ridgelets and Curvelets revisitedJerome Gilles, Giang Tran, Stanley Osher
A recently developed new approach, called ``Empirical Wavelet Transform'', aims to build 1D adaptive wavelet frames accordingly to the analyzed signal. In this paper, we present several extensions of this approach to 2D signals (images). We revisit some well-known transforms (tensor wavelets, Littlewood-Paley wavelets, ridgelets and curvelets) and show that it is possible to build their empirical counterpart. We prove that such constructions lead to different adaptive frames which show some promising properties for image analysis and processing.
CVOct 30, 2024
Wavelet Burst Accumulation for turbulence mitigationJerome Gilles, Stanley Osher
In this paper, we investigate the extension of the recently proposed weighted Fourier burst accumulation (FBA) method into the wavelet domain. The purpose of FBA is to reconstruct a clean and sharp image from a sequence of blurred frames. This concept lies in the construction of weights to amplify dominant frequencies in the Fourier spectrum of each frame. The reconstructed image is then obtained by taking the inverse Fourier transform of the average of all processed spectra. In this paper, we first suggest to replace the rigid registration step used in the original algorithm by a non-rigid registration in order to be able to process sequences acquired through atmospheric turbulence. Second, we propose to work in a wavelet domain instead of the Fourier one. This leads us to the construction of two types of algorithms. Finally, we propose an alternative approach to replace the weighting idea by an approach promoting the sparsity in the used space. Several experiments are provided to illustrate the efficiency of the proposed methods.
CVNov 5, 2024
Fried deconvolutionJerome Gilles, Stanley Osher
In this paper we present a new approach to deblur the effect of atmospheric turbulence in the case of long range imaging. Our method is based on an analytical formulation, the Fried kernel, of the atmosphere modulation transfer function (MTF) and a framelet based deconvolution algorithm. An important parameter is the refractive index structure which requires specific measurements to be known. Then we propose a method which provides a good estimation of this parameter from the input blurred image. The final algorithms are very easy to implement and show very good results on both simulated blur and real images.
CVOct 30, 2024
Bregman implementation of Meyer's $G-$norm for cartoon + textures decompositionJerome Gilles, Stanley Osher
In this paper, we design a very simple algorithm based on Split Bregman iterations to numerically solve the cartoon + textures decomposition model of Meyer. This results in a significant gain in speed compared to Chambolle's nonlinear projectors.
NANov 9, 2024
A Natural Primal-Dual Hybrid Gradient Method for Adversarial Neural Network Training on Solving Partial Differential EquationsShu Liu, Stanley Osher, Wuchen Li
We propose a scalable preconditioned primal-dual hybrid gradient algorithm for solving partial differential equations (PDEs). We multiply the PDE with a dual test function to obtain an inf-sup problem whose loss functional involves lower-order differential operators. The Primal-Dual Hybrid Gradient (PDHG) algorithm is then leveraged for this saddle point problem. By introducing suitable precondition operators to the proximal steps in the PDHG algorithm, we obtain an alternative natural gradient ascent-descent optimization scheme for updating the neural network parameters. We apply the Krylov subspace method (MINRES) to evaluate the natural gradients efficiently. Such treatment readily handles the inversion of precondition matrices via matrix-vector multiplication. A posterior convergence analysis is established for the time-continuous version of the proposed method. The algorithm is tested on various types of PDEs with dimensions ranging from $1$ to $50$, including linear and nonlinear elliptic equations, reaction-diffusion equations, and Monge-Ampère equations stemming from the $L^2$ optimal transport problems. We compare the performance of the proposed method with several commonly used deep learning algorithms such as physics-informed neural networks (PINNs), the DeepRitz method, weak adversarial networks (WANs), etc, for solving PDEs using the Adam and L-BFGS optimizers. The numerical results suggest that the proposed method performs efficiently and robustly and converges more stably.
LGJan 31, 2025
Neural Implicit Solution Formula for Efficiently Solving Hamilton-Jacobi EquationsYesom Park, Stanley Osher
This paper presents an implicit solution formula for the Hamilton-Jacobi partial differential equation (HJ PDE). The formula is derived using the method of characteristics and is shown to coincide with the Hopf and Lax formulas in the case where either the Hamiltonian or the initial function is convex. It provides a simple and efficient numerical approach for computing the viscosity solution of HJ PDEs, bypassing the need for the Legendre transform of the Hamiltonian or the initial condition, and the explicit computation of individual characteristic trajectories. A deep learning-based methodology is proposed to learn this implicit solution formula, leveraging the mesh-free nature of deep learning to ensure scalability for high-dimensional problems. Building upon this framework, an algorithm is developed that approximates the characteristic curves piecewise linearly for state-dependent Hamiltonians. Extensive experimental results demonstrate that the proposed method delivers highly accurate solutions, even for nonconvex Hamiltonians, and exhibits remarkable scalability, achieving computational efficiency for problems up to 40 dimensions.
LGJan 30, 2025
OT-Transformer: A Continuous-time Transformer Architecture with Optimal Transport RegularizationKelvin Kan, Xingjian Li, Stanley Osher
Transformers have achieved state-of-the-art performance in numerous tasks. In this paper, we propose a continuous-time formulation of transformers. Specifically, we consider a dynamical system whose governing equation is parametrized by transformer blocks. We leverage optimal transport theory to regularize the training problem, which enhances stability in training and improves generalization of the resulting model. Moreover, we demonstrate in theory that this regularization is necessary as it promotes uniqueness and regularity of solutions. Our model is flexible in that almost any existing transformer architectures can be adopted to construct the dynamical system with only slight modifications to the existing code. We perform extensive numerical experiments on tasks motivated by natural language processing, image classification, and point cloud classification. Our experimental results show that the proposed method improves the performance of its discrete counterpart and outperforms relevant comparing models.
LGOct 10, 2025
Stability of Transformers under Layer NormalizationKelvin Kan, Xingjian Li, Benjamin J. Zhang et al.
Despite their widespread use, training deep Transformers can be unstable. Layer normalization, a standard component, improves training stability, but its placement has often been ad-hoc. In this paper, we conduct a principled study on the forward (hidden states) and backward (gradient) stability of Transformers under different layer normalization placements. Our theory provides key insights into the training dynamics: whether training drives Transformers toward regular solutions or pathological behaviors. For forward stability, we derive explicit bounds on the growth of hidden states in trained Transformers. For backward stability, we analyze how layer normalization affects the backpropagation of gradients, thereby explaining the training dynamics of each layer normalization placement. Our analysis also guides the scaling of residual steps in Transformer blocks, where appropriate choices can further improve stability and performance. Our numerical results corroborate our theoretical findings. Beyond these results, our framework provides a principled way to sanity-check the stability of Transformers under new architectural modifications, offering guidance for future designs.
OCSep 22, 2025
Zero-Shot Transferable Solution Method for Parametric Optimal Control ProblemsXingjian Li, Kelvin Kan, Deepanshu Verma et al.
This paper presents a transferable solution method for optimal control problems with varying objectives using function encoder (FE) policies. Traditional optimization-based approaches must be re-solved whenever objectives change, resulting in prohibitive computational costs for applications requiring frequent evaluation and adaptation. The proposed method learns a reusable set of neural basis functions that spans the control policy space, enabling efficient zero-shot adaptation to new tasks through either projection from data or direct mapping from problem specifications. The key idea is an offline-online decomposition: basis functions are learned once during offline imitation learning, while online adaptation requires only lightweight coefficient estimation. Numerical experiments across diverse dynamics, dimensions, and cost structures show our method delivers near-optimal performance with minimal overhead when generalizing across tasks, enabling semi-global feedback policies suitable for real-time deployment.
LOMay 31, 2025
Thinking Out of the Box: Hybrid SAT Solving by Unconstrained Continuous OptimizationZhiwei Zhang, Samy Wu Fung, Anastasios Kyrillidis et al.
The Boolean satisfiability (SAT) problem lies at the core of many applications in combinatorial optimization, software verification, cryptography, and machine learning. While state-of-the-art solvers have demonstrated high efficiency in handling conjunctive normal form (CNF) formulas, numerous applications require non-CNF (hybrid) constraints, such as XOR, cardinality, and Not-All-Equal constraints. Recent work leverages polynomial representations to represent such hybrid constraints, but it relies on box constraints that can limit the use of powerful unconstrained optimizers. In this paper, we propose unconstrained continuous optimization formulations for hybrid SAT solving by penalty terms. We provide theoretical insights into when these penalty terms are necessary and demonstrate empirically that unconstrained optimizers (e.g., Adam) can enhance SAT solving on hybrid benchmarks. Our results highlight the potential of combining continuous optimization and machine-learning-based methods for effective hybrid SAT solving.
TRJan 25, 2025
In-Context Operator Learning for Linear Propagator ModelsTingwei Meng, Moritz Voß, Nils Detering et al.
We study operator learning in the context of linear propagator models for optimal order execution problems with transient price impact à la Bouchaud et al. (2004) and Gatheral (2010). Transient price impact persists and decays over time according to some propagator kernel. Specifically, we propose to use In-Context Operator Networks (ICON), a novel transformer-based neural network architecture introduced by Yang et al. (2023), which facilitates data-driven learning of operators by merging offline pre-training with an online few-shot prompting inference. First, we train ICON to learn the operator from various propagator models that maps the trading rate to the induced transient price impact. The inference step is then based on in-context prediction, where ICON is presented only with a few examples. We illustrate that ICON is capable of accurately inferring the underlying price impact model from the data prompts, even with propagator kernels not seen in the training data. In a second step, we employ the pre-trained ICON model provided with context as a surrogate operator in solving an optimal order execution problem via a neural network control policy, and demonstrate that the exact optimal execution strategies from Abi Jaber and Neuman (2022) for the models generating the context are correctly retrieved. Our introduced methodology is very general, offering a new approach to solving optimal stochastic control problems with unknown state dynamics, inferred data-efficiently from a limited number of examples by leveraging the few-shot and transfer learning capabilities of transformer networks.
MLJan 14
Accelerated Regularized Wasserstein Proximal Sampling AlgorithmsHong Ye Tan, Stanley Osher, Wuchen Li
We consider sampling from a Gibbs distribution by evolving a finite number of particles using a particular score estimator rather than Brownian motion. To accelerate the particles, we consider a second-order score-based ODE, similar to Nesterov acceleration. In contrast to traditional kernel density score estimation, we use the recently proposed regularized Wasserstein proximal method, yielding the Accelerated Regularized Wasserstein Proximal method (ARWP). We provide a detailed analysis of continuous- and discrete-time non-asymptotic and asymptotic mixing rates for Gaussian initial and target distributions, using techniques from Euclidean acceleration and accelerated information gradients. Compared with the kinetic Langevin sampling algorithm, the proposed algorithm exhibits a higher contraction rate in the asymptotic time regime. Numerical experiments are conducted across various low-dimensional experiments, including multi-modal Gaussian mixtures and ill-conditioned Rosenbrock distributions. ARWP exhibits structured and convergent particles, accelerated discrete-time mixing, and faster tail exploration than the non-accelerated regularized Wasserstein proximal method and kinetic Langevin methods. Additionally, ARWP particles exhibit better generalization properties for some non-log-concave Bayesian neural network tasks.
LGNov 26, 2025
Dynamical Implicit Neural RepresentationsYesom Park, Kelvin Kan, Thomas Flynn et al.
Implicit Neural Representations (INRs) provide a powerful continuous framework for modeling complex visual and geometric signals, but spectral bias remains a fundamental challenge, limiting their ability to capture high-frequency details. Orthogonal to existing remedy strategies, we introduce Dynamical Implicit Neural Representations (DINR), a new INR modeling framework that treats feature evolution as a continuous-time dynamical system rather than a discrete stack of layers. This dynamical formulation mitigates spectral bias by enabling richer, more adaptive frequency representations through continuous feature evolution. Theoretical analysis based on Rademacher complexity and the Neural Tangent Kernel demonstrates that DINR enhances expressivity and improves training dynamics. Moreover, regularizing the complexity of the underlying dynamics provides a principled way to balance expressivity and generalization. Extensive experiments on image representation, field reconstruction, and data compression confirm that DINR delivers more stable convergence, higher signal fidelity, and stronger generalization than conventional static INRs.
LGOct 18, 2025
Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ PriorFuqun Han, Stanley Osher, Wuchen Li
In this work, we propose a sparse transformer architecture that incorporates prior information about the underlying data distribution directly into the transformer structure of the neural network. The design of the model is motivated by a special optimal transport problem, namely the regularized Wasserstein proximal operator, which admits a closed-form solution and turns out to be a special representation of transformer architectures. Compared with classical flow-based models, the proposed approach improves the convexity properties of the optimization problem and promotes sparsity in the generated samples. Through both theoretical analysis and numerical experiments, including applications in generative modeling and Bayesian inverse problems, we demonstrate that the sparse transformer achieves higher accuracy and faster convergence to the target distribution than classical neural ODE-based methods.
LGOct 4, 2025
Implicit Models: Expressive Power Scales with Test-Time ComputeJialin Liu, Lisang Ding, Stanley Osher et al.
Implicit models, an emerging model class, compute outputs by iterating a single parameter block to a fixed point. This architecture realizes an infinite-depth, weight-tied network that trains with constant memory, significantly reducing memory needs for the same level of performance compared to explicit models. While it is empirically known that these compact models can often match or even exceed larger explicit networks by allocating more test-time compute, the underlying mechanism remains poorly understood. We study this gap through a nonparametric analysis of expressive power. We provide a strict mathematical characterization, showing that a simple and regular implicit operator can, through iteration, progressively express more complex mappings. We prove that for a broad class of implicit models, this process lets the model's expressive power scale with test-time compute, ultimately matching a much richer function class. The theory is validated across three domains: image reconstruction, scientific computing, and operations research, demonstrating that as test-time iterations increase, the complexity of the learned mapping rises, while the solution quality simultaneously improves and stabilizes.
OCOct 1, 2025
End-to-End Training of High-Dimensional Optimal Control with Implicit Hamiltonians via Jacobian-Free BackpropagationEric Gelphman, Deepanshu Verma, Nicole Tianjiao Yang et al.
Neural network approaches that parameterize value functions have succeeded in approximating high-dimensional optimal feedback controllers when the Hamiltonian admits explicit formulas. However, many practical problems, such as the space shuttle reentry problem and bicycle dynamics, among others, may involve implicit Hamiltonians that do not admit explicit formulas, limiting the applicability of existing methods. Rather than directly parameterizing controls, which does not leverage the Hamiltonian's underlying structure, we propose an end-to-end implicit deep learning approach that directly parameterizes the value function to learn optimal control laws. Our method enforces physical principles by ensuring trained networks adhere to the control laws by exploiting the fundamental relationship between the optimal control and the value function's gradient; this is a direct consequence of the connection between Pontryagin's Maximum Principle and dynamic programming. Using Jacobian-Free Backpropagation (JFB), we achieve efficient training despite temporal coupling in trajectory optimization. We show that JFB produces descent directions for the optimal control objective and experimentally demonstrate that our approach effectively learns high-dimensional feedback controllers across multiple scenarios involving implicit Hamiltonians, which existing methods cannot address.
LGSep 30, 2025
Neural Hamilton--Jacobi Characteristic Flows for Optimal TransportYesom Park, Shu Liu, Mo Zhou et al.
We present a novel framework for solving optimal transport (OT) problems based on the Hamilton--Jacobi (HJ) equation, whose viscosity solution uniquely characterizes the OT map. By leveraging the method of characteristics, we derive closed-form, bidirectional transport maps, thereby eliminating the need for numerical integration. The proposed method adopts a pure minimization framework: a single neural network is trained with a loss function derived from the method of characteristics of the HJ equation. This design guarantees convergence to the optimal map while eliminating adversarial training stages, thereby substantially reducing computational complexity. Furthermore, the framework naturally extends to a wide class of cost functions and supports class-conditional transport. Extensive experiments on diverse datasets demonstrate the accuracy, scalability, and efficiency of the proposed method, establishing it as a principled and versatile tool for OT applications with provable optimality.
MLSep 1, 2025
Preconditioned Regularized Wasserstein Proximal SamplingHong Ye Tan, Stanley Osher, Wuchen Li
We consider sampling from a Gibbs distribution by evolving finitely many particles. We propose a preconditioned version of a recently proposed noise-free sampling method, governed by approximating the score function with the numerically tractable score of a regularized Wasserstein proximal operator. This is derived by a Cole--Hopf transformation on coupled anisotropic heat equations, yielding a kernel formulation for the preconditioned regularized Wasserstein proximal. The diffusion component of the proposed method is also interpreted as a modified self-attention block, as in transformer architectures. For quadratic potentials, we provide a discrete-time non-asymptotic convergence analysis and explicitly characterize the bias, which is dependent on regularization and independent of step-size. Experiments demonstrate acceleration and particle-level stability on various log-concave and non-log-concave toy examples to Bayesian total-variation regularized image deconvolution, and competitive/better performance on non-convex Bayesian neural network training when utilizing variable preconditioning matrices.
LGOct 13, 2021
How Does Momentum Benefit Deep Neural Networks Architecture Design? A Few Case StudiesBao Wang, Hedi Xia, Tan Nguyen et al.
We present and review an algorithmic and theoretical framework for improving neural network architecture design via momentum. As case studies, we consider how momentum can improve the architecture design for recurrent neural networks (RNNs), neural ordinary differential equations (ODEs), and transformers. We show that integrating momentum into neural network architectures has several remarkable theoretical and empirical benefits, including 1) integrating momentum into RNNs and neural ODEs can overcome the vanishing gradient issues in training RNNs and neural ODEs, resulting in effective learning long-term dependencies. 2) momentum in neural ODEs can reduce the stiffness of the ODE dynamics, which significantly enhances the computational efficiency in training and testing. 3) momentum can improve the efficiency and accuracy of transformers.
LGJun 2, 2021
Operator Splitting for Learning to Predict Equilibria in Convex GamesDaniel McKenzie, Howard Heaton, Qiuwei Li et al.
Systems of competing agents can often be modeled as games. Assuming rationality, the most likely outcomes are given by an equilibrium (e.g. a Nash equilibrium). In many practical settings, games are influenced by context, i.e. additional data beyond the control of any agent (e.g. weather for traffic and fiscal policy for market economies). Often the exact game mechanics are unknown, yet vast amounts of historical data consisting of (context, equilibrium) pairs are available, raising the possibility of learning a solver which predicts the equilibria given only the context. We introduce Nash Fixed Point Networks (N-FPNs), a class of neural networks that naturally output equilibria. Crucially, N- FPNs employ a constraint decoupling scheme to handle complicated agent action sets while avoiding expensive projections. Empirically, we find N-FPNs are compatible with the recently developed Jacobian-Free Backpropagation technique for training implicit networks, making them significantly faster and easier to train than prior models. Our experiments show N-FPNs are capable of scaling to problems orders of magnitude larger than existing learned game solvers.
LGMar 23, 2021
JFB: Jacobian-Free Backpropagation for Implicit NetworksSamy Wu Fung, Howard Heaton, Qiuwei Li et al.
A promising trend in deep learning replaces traditional feedforward networks with implicit networks. Unlike traditional networks, implicit networks solve a fixed point equation to compute inferences. Solving for the fixed point varies in complexity, depending on provided data and an error tolerance. Importantly, implicit networks may be trained with fixed memory costs in stark contrast to feedforward networks, whose memory requirements scale linearly with depth. However, there is no free lunch -- backpropagation through implicit networks often requires solving a costly Jacobian-based equation arising from the implicit function theorem. We propose Jacobian-Free Backpropagation (JFB), a fixed-memory approach that circumvents the need to solve Jacobian-based equations. JFB makes implicit networks faster to train and significantly easier to implement, without sacrificing test accuracy. Our experiments show implicit networks trained with JFB are competitive with feedforward networks and prior implicit networks given the same number of parameters.
LGFeb 13, 2021
Wasserstein Proximal of GANsAlex Tong Lin, Wuchen Li, Stanley Osher et al.
We introduce a new method for training generative adversarial networks by applying the Wasserstein-2 metric proximal on the generators. The approach is based on Wasserstein information geometry. It defines a parametrization invariant natural gradient by pulling back optimal transport structures from probability space to parameter space. We obtain easy-to-implement iterative regularizers for the parameter updates of implicit deep generative models. Our experiments demonstrate that this method improves the speed and stability of training in terms of wall-clock time and Fréchet Inception Distance.
LGAug 5, 2020
Wasserstein-based Projections with Applications to Inverse ProblemsHoward Heaton, Samy Wu Fung, Alex Tong Lin et al.
Inverse problems consist of recovering a signal from a collection of noisy measurements. These are typically cast as optimization problems, with classic approaches using a data fidelity term and an analytic regularizer that stabilizes recovery. Recent Plug-and-Play (PnP) works propose replacing the operator for analytic regularization in optimization methods by a data-driven denoiser. These schemes obtain state of the art results, but at the cost of limited theoretical guarantees. To bridge this gap, we present a new algorithm that takes samples from the manifold of true data as input and outputs an approximation of the projection operator onto this manifold. Under standard assumptions, we prove this algorithm generates a learned operator, called Wasserstein-based projection (WP), that approximates the true projection with high probability. Thus, WPs can be inserted into optimization methods in the same manner as PnP, but now with theoretical guarantees. Provided numerical examples show WPs obtain state of the art results for unsupervised PnP signal recovery.
LGMay 1, 2020
Differentially Private Federated Learning with Laplacian SmoothingZhicong Liang, Bao Wang, Quanquan Gu et al.
Federated learning aims to protect data privacy by collaboratively learning a model without sharing private data among users. However, an adversary may still be able to infer the private training data by attacking the released model. Differential privacy provides a statistical protection against such attacks at the price of significantly degrading the accuracy or utility of the trained models. In this paper, we investigate a utility enhancement scheme based on Laplacian smoothing for differentially private federated learning (DP-Fed-LS), where the parameter aggregation with injected Gaussian noise is improved in statistical precision without losing privacy budget. Our key observation is that the aggregated gradients in federated learning often enjoy a type of smoothness, i.e. sparsity in the graph Fourier basis with polynomial decays of Fourier coefficients as frequency grows, which can be exploited by the Laplacian smoothing efficiently. Under a prescribed differential privacy budget, convergence error bounds with tight rates are provided for DP-Fed-LS with uniform subsampling of heterogeneous Non-IID data, revealing possible utility improvement of Laplacian smoothing in effective dimensionality and variance reduction, among others. Experiments over MNIST, SVHN, and Shakespeare datasets show that the proposed method can improve model accuracy with DP-guarantee and membership privacy under both uniform and Poisson subsampling mechanisms.
LGDec 4, 2019
A Machine Learning Framework for Solving High-Dimensional Mean Field Game and Mean Field Control ProblemsLars Ruthotto, Stanley Osher, Wuchen Li et al.
Mean field games (MFG) and mean field control (MFC) are critical classes of multi-agent models for efficient analysis of massive populations of interacting agents. Their areas of application span topics in economics, finance, game theory, industrial engineering, crowd motion, and more. In this paper, we provide a flexible machine learning framework for the numerical solution of potential MFG and MFC models. State-of-the-art numerical methods for solving such problems utilize spatial discretization that leads to a curse-of-dimensionality. We approximately solve high-dimensional problems by combining Lagrangian and Eulerian viewpoints and leveraging recent advances from machine learning. More precisely, we work with a Lagrangian formulation of the problem and enforce the underlying Hamilton-Jacobi-Bellman (HJB) equation that is derived from the Eulerian formulation. Finally, a tailored neural network parameterization of the MFG/MFC solution helps us avoid any spatial discretization. Our numerical results include the approximate solution of 100-dimensional instances of optimal transport and crowd motion problems on a standard work station and a validation using an Eulerian solver in two dimensions. These results open the door to much-anticipated applications of MFG and MFC models that were beyond reach with existing numerical methods.
LGMar 13, 2019
Understanding Straight-Through Estimator in Training Activation Quantized Neural NetsPenghang Yin, Jiancheng Lyu, Shuai Zhang et al.
Training activation quantized neural networks involves minimizing a piecewise constant function whose gradient vanishes almost everywhere, which is undesirable for the standard back-propagation or chain rule. An empirical way around this issue is to use a straight-through estimator (STE) (Bengio et al., 2013) in the backward pass only, so that the "gradient" through the modified chain rule becomes non-trivial. Since this unusual "gradient" is certainly not the gradient of loss function, the following question arises: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a two-linear-layer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual "gradient" given by the STE-modifed chain rule as coarse gradient. The choice of STE is not unique. We prove that if the STE is properly chosen, the expected coarse gradient correlates positively with the population gradient (not available for the training), and its negation is a descent direction for minimizing the population loss. We further show the associated coarse gradient descent algorithm converges to a critical point of the population loss minimization problem. Moreover, we show that a poor choice of STE leads to instability of the training algorithm near certain local minima, which is verified with CIFAR-10 experiments.
MAFeb 6, 2019
Decentralized Multi-Agents by Imitation of a Centralized ControllerAlex Tong Lin, Mark J. Debord, Katia Estabridis et al.
We consider a multi-agent reinforcement learning problem where each agent seeks to maximize a shared reward while interacting with other agents, and they may or may not be able to communicate. Typically the agents do not have access to other agent policies and thus each agent is situated in a non-stationary and partially-observable environment. In order to obtain multi-agents that act in a decentralized manner, we introduce a novel algorithm under the popular framework of centralized training, but decentralized execution. This training framework first obtains solutions to a multi-agent problem with a single centralized joint-space learner, which is then used to guide imitation learning for independent decentralized multi-agents. This framework has the flexibility to use any reinforcement learning algorithm to obtain the expert as well as any imitation learning algorithm to obtain the decentralized agents. This is in contrast to other multi-agent learning algorithms that, for example, can require more specific structures. We present some theoretical bounds for our method, and we show that one can obtain decentralized solutions to a multi-agent problem through imitation learning.