Markos A. Katsoulakis

h-index32

38papers

527citations

Novelty57%

AI Score54

Ranked #25,893 of 201,326 authors (top 13%)#208 in ML (top 6%)

38 Papers

NASep 2, 2011

Multilevel coarse graining and nano--pattern discovery in many particle stochastic systems

Evangelia Kalligiannaki, Markos A. Katsoulakis, Petr Plechac et al.

In this work we propose a hierarchy of Monte Carlo methods for sampling equilibrium properties of stochastic lattice systems with competing short and long range interactions. Each Monte Carlo step is composed by two or more sub - steps efficiently coupling coarse and microscopic state spaces. The method can be designed to sample the exact or controlled-error approximations of the target distribution, providing information on levels of different resolutions, as well as at the microscopic level. In both strategies the method achieves significant reduction of the computational cost compared to conventional Markov Chain Monte Carlo methods. Applications in phase transition and pattern formation problems confirm the efficiency of the proposed methods.

NAJan 11, 2016

Path-space variational inference for non-equilibrium coarse-grained systems

Vagelis Harmandaris, Evangelia Kalligiannaki, Markos A. Katsoulakis et al.

In this paper, we discuss information-theoretic tools for obtaining optimized coarse-grained molecular models for both equilibrium and non-equilibrium molecular dynamics. The latter are ubiquitous in physicochemical and biological applications, where they are typically associated with coupling mechanisms, multi-physics and/or boundary conditions. In general the non-equilibrium steady states are not known explicitly as they do not necessarily have a Gibbs structure. The presented approach can compare microscopic behavior of molecular systems to parametric and non-parametric coarse-grained one using the relative entropy between distributions on the path space and setting up a corresponding path space variational inference problem. The methods can become entirely data-driven when the microscopic dynamics are replaced with corresponding correlated data in the form of time series. Furthermore, we present connections and generalizations of force matching methods in coarse-graining with path-space information methods, as well as demonstrate the enhanced transferability of information-based parameterizations to general observables due to information inequalities. We further discuss methodological connections between information-based coarse-graining of molecular systems and variational inference methods primarily developed in the machine learning community. However, we note that the work presented here addresses variational inference for correlated time series due to the focus on dynamics. The applicability of the proposed methods is demonstrated on high-dimensional stochastic processes given by Langevin, overdamped and driven Langevin dynamics of interacting particles.

NANov 27, 2011

Noise regularization and computations for the 1-dimensional stochastic Allen-Cahn problem

Markos A. Katsoulakis, Georgios T. Kossioris, Omar Lakkis

We address the numerical discretization of the Allen-Cahn prob- lem with additive white noise in one-dimensional space. The discretization is conducted in two stages: (1) regularize the white noise and study the regularized problem, (2) approximate the regularized problem. We address (1) by introducing a piecewise constant random approximation of the white noise with respect to a space-time mesh. We analyze the regularized problem and study its relation to both the original problem and the deterministic Allen-Cahn problem. Step (2) is then performed leading to a practical Monte-Carlo method combined with a Finite Element-Implicit Euler scheme. The resulting numerical scheme is tested against theoretical benchmark results.

MLApr 26, 2023

A mean-field games laboratory for generative modeling

Benjamin J. Zhang, Markos A. Katsoulakis

We demonstrate the versatility of mean-field games (MFGs) as a mathematical framework for explaining, enhancing, and designing generative models. In generative flows, a Lagrangian formulation is used where each particle (generated sample) aims to minimize a loss function over its simulated path. The loss, however, is dependent on the paths of other particles, which leads to a competition among the population of particles. The asymptotic behavior of this competition yields a mean-field game. We establish connections between MFGs and major classes of generative flows and diffusions including continuous-time normalizing flows, score-based generative models (SGM), and Wasserstein gradient flows. Furthermore, we study the mathematical properties of each generative model by studying their associated MFG's optimality condition, which is a set of coupled forward-backward nonlinear partial differential equations. The mathematical structure described by the MFG optimality conditions identifies the inductive biases of generative flows. We investigate the well-posedness and structure of normalizing flows, unravel the mathematical structure of SGMs, and derive a MFG formulation of Wasserstein gradient flows. From an algorithmic perspective, the optimality conditions yields Hamilton-Jacobi-Bellman (HJB) regularizers for enhanced training of generative models. In particular, we propose and demonstrate an HJB-regularized SGM with improved performance over standard SGMs. We present this framework as an MFG laboratory which serves as a platform for revealing new avenues of experimentation and invention of generative models.

MLOct 10, 2022

Function-space regularized Rényi divergences

Jeremiah Birrell, Yannis Pantazis, Paul Dupuis et al.

We propose a new family of regularized Rényi divergences parametrized not only by the order $α$ but also by a variational function space. These new objects are defined by taking the infimal convolution of the standard Rényi divergence with the integral probability metric (IPM) associated with the chosen function space. We derive a novel dual variational representation that can be used to construct numerically tractable divergence estimators. This representation avoids risk-sensitive terms and therefore exhibits lower variance, making it well-behaved when $α>1$; this addresses a notable weakness of prior approaches. We prove several properties of these new divergences, showing that they interpolate between the classical Rényi divergences and IPMs. We also study the $α\to\infty$ limit, which leads to a regularized worst-case-regret and a new variational representation in the classical case. Moreover, we show that the proposed regularized Rényi divergences inherit features from IPMs such as the ability to compare distributions that are not absolutely continuous, e.g., empirical measures and distributions with low-dimensional support. We present numerical results on both synthetic and real datasets, showing the utility of these new divergences in both estimation and GAN training applications; in particular, we demonstrate significantly reduced variance and improved training performance.

NAMay 24, 2011

Hierarchical fractional-step approximations and parallel kinetic Monte Carlo algorithms

Giorgos Arampatzis, Markos A. Katsoulakis, Petr Plechac et al.

We present a mathematical framework for constructing and analyzing parallel algorithms for lattice Kinetic Monte Carlo (KMC) simulations. The resulting algorithms have the capacity to simulate a wide range of spatio-temporal scales in spatially distributed, non-equilibrium physiochemical processes with complex chemistry and transport micro-mechanisms. The algorithms can be tailored to specific hierarchical parallel architectures such as multi-core processors or clusters of Graphical Processing Units (GPUs). The proposed parallel algorithms are controlled-error approximations of kinetic Monte Carlo algorithms, departing from the predominant paradigm of creating parallel KMC algorithms with exactly the same master equation as the serial one. Our methodology relies on a spatial decomposition of the Markov operator underlying the KMC algorithm into a hierarchy of operators corresponding to the processors' structure in the parallel architecture. Based on this operator decomposition, we formulate Fractional Step Approximation schemes by employing the Trotter Theorem and its random variants; these schemes, (a) determine the communication schedule} between processors, and (b) are run independently on each processor through a serial KMC simulation, called a kernel, on each fractional step time-window. Furthermore, the proposed mathematical framework allows us to rigorously justify the numerical and statistical consistency of the proposed algorithms, showing the convergence of our approximating schemes to the original serial KMC. The approach also provides a systematic evaluation of different processor communicating schedules.

NAAug 5, 2012

Parallelization, processor communication and error analysis in lattice kinetic Monte Carlo

Giorgos Arampatzis, Markos A. Katsoulakis, Petr Plechac

In this paper we study from a numerical analysis perspective the Fractional Step Kinetic Monte Carlo (FS-KMC) algorithms proposed in [1] for the parallel simulation of spatially distributed particle systems on a lattice. FS-KMC are fractional step algorithms with a time-stepping window $Δt$, and as such they are inherently partially asynchronous since there is no processor communication during the period $Δt$. In this contribution we primarily focus on the error analysis of FS-KMC algorithms as approximations of conventional, serial kinetic Monte Carlo (KMC). A key aspect of our analysis relies on emphasising a goal-oriented approach for suitably defined macroscopic observables (e.g., density, energy, correlations, surface roughness), rather than focusing on strong topology estimates for individual trajectories. One of the key implications of our error analysis is that it allows us to address systematically the processor communication of different parallelization strategies for KMC by comparing their (partial) asynchrony, which in turn is measured by their respective fractional time step $Δt$ for a prescribed error tolerance.

NAFeb 23, 2016

Efficient estimators for likelihood ratio sensitivity indices of complex stochastic dynamics

Georgios Arampatzis, Markos A. Katsoulakis, Luc Rey-Bellet

We demonstrate that centered likelihood ratio estimators for the sensitivity indices of complex stochastic dynamics are highly efficient with low, constant in time variance and consequently they are suitable for sensitivity analysis in long-time and steady-state regimes. These estimators rely on a new covariance formulation of the likelihood ratio that includes as a submatrix a Fisher Information Matrix for stochastic dynamics and can also be used for fast screening of insensitive parameters and parameter combinations. The proposed methods are applicable to broad classes of stochastic dynamics such as chemical reaction networks, Langevin-type equations and stochastic models in finance, including systems with a high dimensional parameter space and/or disparate decorrelation times between different observables. Furthermore, they are simple to implement as a standard observable in any existing simulation algorithms without additional modifications.

NAAug 3, 2012

Spatial multi-level interacting particle simulations and information theory-based error quantification

Evangelia Kalligiannaki, Markos A. Katsoulakis, Petr Plechac

We propose a hierarchy of multi-level kinetic Monte Carlo methods for sampling high-dimensional, stochastic lattice particle dynamics with complex interactions. The method is based on the efficient coupling of different spatial resolution levels, taking advantage of the low sampling cost in a coarse space and by developing local reconstruction strategies from coarse-grained dynamics. Microscopic reconstruction corrects possibly significant errors introduced through coarse-graining, leading to the controlled-error approximation of the sampled stochastic process. In this manner, the proposed multi-level algorithm overcomes known shortcomings of coarse-graining of particle systems with complex interactions such as combined long and short-range particle interactions and/or complex lattice geometries. Specifically, we provide error analysis for the approximation of long-time stationary dynamics in terms of relative entropy and prove that information loss in the multi-level methods is growing linearly in time, which in turn implies that an appropriate observable in the stationary regime is the information loss of the path measures per unit time. We show that this observable can be either estimated a priori, or it can be tracked computationally a posteriori in the course of a simulation. The stationary regime is of critical importance to molecular simulations as it is relevant to long-time sampling, obtaining phase diagrams and in studying metastability properties of high-dimensional complex systems. Finally, the multi-level nature of the method provides flexibility in combining rejection-free and null-event implementations, generating a hierarchy of algorithms with an adjustable number of rejections that includes well-known rejection-free and null-event algorithms.

NAJun 18, 2010

Coupled coarse graining and Markov Chain Monte Carlo for lattice systems

Evangelia Kalligiannaki, Markos A. Katsoulakis, Petr Plechac

We propose an efficient Markov Chain Monte Carlo method for sampling equilibrium distributions for stochastic lattice models, capable of handling correctly long and short-range particle interactions. The proposed method is a Metropolis-type algorithm with the proposal probability transition matrix based on the coarse-grained approximating measures introduced in a series of works of M. Katsoulakis, A. Majda, D. Vlachos and P. Plechac, L. Rey-Bellet and D.Tsagkarogiannis,. We prove that the proposed algorithm reduces the computational cost due to energy differences and has comparable mixing properties with the classical microscopic Metropolis algorithm, controlled by the level of coarsening and reconstruction procedure. The properties and effectiveness of the algorithm are demonstrated with an exactly solvable example of a one dimensional Ising-type model, comparing efficiency of the single spin-flip Metropolis dynamics and the proposed coupled Metropolis algorithm.

MLJul 16, 2024

Combining Wasserstein-1 and Wasserstein-2 proximals: robust manifold learning via well-posed generative flows

Hyemin Gu, Markos A. Katsoulakis, Luc Rey-Bellet et al.

We formulate well-posed continuous-time generative flows for learning distributions that are supported on low-dimensional manifolds through Wasserstein proximal regularizations of $f$-divergences. Wasserstein-1 proximal operators regularize $f$-divergences so that singular distributions can be compared. Meanwhile, Wasserstein-2 proximal operators regularize the paths of the generative flows by adding an optimal transport cost, i.e., a kinetic energy penalization. Via mean-field game theory, we show that the combination of the two proximals is critical for formulating well-posed generative flows. Generative flows can be analyzed through optimality conditions of a mean-field game (MFG), a system of a backward Hamilton-Jacobi (HJ) and a forward continuity partial differential equations (PDEs) whose solution characterizes the optimal generative flow. For learning distributions that are supported on low-dimensional manifolds, the MFG theory shows that the Wasserstein-1 proximal, which addresses the HJ terminal condition, and the Wasserstein-2 proximal, which addresses the HJ dynamics, are both necessary for the corresponding backward-forward PDE system to be well-defined and have a unique solution with provably linear flow trajectories. This implies that the corresponding generative flow is also unique and can therefore be learned in a robust manner even for learning high-dimensional distributions supported on low-dimensional manifolds. The generative flows are learned through adversarial training of continuous-time flows, which bypasses the need for reverse simulation. We demonstrate the efficacy of our approach for generating high-dimensional images without the need to resort to autoencoders or specialized architectures.

MLOct 2, 2024

Equivariant score-based generative models provably learn distributions with symmetries efficiently

Ziyu Chen, Markos A. Katsoulakis, Benjamin J. Zhang

Symmetry is ubiquitous in many real-world phenomena and tasks, such as physics, images, and molecular simulations. Empirical studies have demonstrated that incorporating symmetries into generative models can provide better generalization and sampling efficiency when the underlying data distribution has group symmetry. In this work, we provide the first theoretical analysis and guarantees of score-based generative models (SGMs) for learning distributions that are invariant with respect to some group symmetry and offer the first quantitative comparison between data augmentation and adding equivariant inductive bias. First, building on recent works on the Wasserstein-1 ($\mathbf{d}_1$) guarantees of SGMs and empirical estimations of probability divergences under group symmetry, we provide an improved $\mathbf{d}_1$ generalization bound when the data distribution is group-invariant. Second, we describe the inductive bias of equivariant SGMs using Hamilton-Jacobi-Bellman theory, and rigorously demonstrate that one can learn the score of a symmetrized distribution using equivariant vector fields without data augmentations through the analysis of the optimality and equivalence of score-matching objectives. This also provides practical guidance that one does not have to augment the dataset as long as the vector field or the neural network parametrization is equivariant. Moreover, we quantify the impact of not incorporating equivariant structure into the score parametrization, by showing that non-equivariant vector fields can yield worse generalization bounds. This can be viewed as a type of model-form error that describes the missing structure of non-equivariant vector fields. Numerical simulations corroborate our analysis and highlight that data augmentations cannot replace the role of equivariant vector fields.

MLOct 31, 2022

Lipschitz-regularized gradient flows and generative particle algorithms for high-dimensional scarce data

Hyemin Gu, Panagiota Birmpa, Yannis Pantazis et al.

We build a new class of generative algorithms capable of efficiently learning an arbitrary target distribution from possibly scarce, high-dimensional data and subsequently generate new samples. These generative algorithms are particle-based and are constructed as gradient flows of Lipschitz-regularized Kullback-Leibler or other $f$-divergences, where data from a source distribution can be stably transported as particles, towards the vicinity of the target distribution. As a highlighted result in data integration, we demonstrate that the proposed algorithms correctly transport gene expression data points with dimension exceeding 54K, while the sample size is typically only in the hundreds.

NAOct 17, 2016

Information Criteria for quantifying loss of reversibility in parallelized KMC

Konstantinos Gourgoulias, Markos A. Katsoulakis, Luc Rey-Bellet

Parallel Kinetic Monte Carlo (KMC) is a potent tool to simulate stochastic particle systems efficiently. However, despite literature on quantifying domain decomposition errors of the particle system for this class of algorithms in the short and in the long time regime, no study yet explores and quantifies the loss of time-reversibility in Parallel KMC. Inspired by concepts from non-equilibrium statistical mechanics, we propose the entropy production per unit time, or entropy production rate, given in terms of an observable and a corresponding estimator, as a metric that quantifies the loss of reversibility. Typically, this is a quantity that cannot be computed explicitly for Parallel KMC, which is why we develop a posteriori estimators that have good scaling properties with respect to the size of the system. Through these estimators, we can connect the different parameters of the scheme, such as the communication time step of the parallelization, the choice of the domain decomposition, and the computational schedule, with its performance in controlling the loss of reversibility. From this point of view, the entropy production rate can be seen both as an information criterion to compare the reversibility of different parallel schemes and as a tool to diagnose reversibility issues with a particular scheme. As a demonstration, we use Sandia Lab's SPPARKS software to compare different parallelization schemes and different domain (lattice) decompositions.

PRFeb 18, 2015

Pathwise Sensitivity Analysis in Transient Regimes

Georgios Arampatzis, Markos A. Katsoulakis, Yannis Pantazis

The instantaneous relative entropy (IRE) and the corresponding instanta- neous Fisher information matrix (IFIM) for transient stochastic processes are pre- sented in this paper. These novel tools for sensitivity analysis of stochastic models serve as an extension of the well known relative entropy rate (RER) and the corre- sponding Fisher information matrix (FIM) that apply to stationary processes. Three cases are studied here, discrete-time Markov chains, continuous-time Markov chains and stochastic differential equations. A biological reaction network is presented as a demonstration numerical example.

NAOct 4, 2016

Information metrics for long-time errors in splitting schemes for stochastic dynamics and parallel KMC

Konstantinos Gourgoulias, Markos A. Katsoulakis, Luc Rey-Bellet

We propose an information-theoretic approach to analyze the long-time behavior of numerical splitting schemes for stochastic dynamics, focusing primarily on Parallel Kinetic Monte Carlo (KMC) algorithms.Established methods for numerical operator splittings provide error estimates in finite-time regimes, in terms of the order of the local error and the associated commutator. Path-space information-theoretic tools such as the relative entropy rate (RER) allow us to control long-time error through commutator calculations. Furthermore, they give rise to an a posteriori representation of the error which can thus be tracked in the course of a simulation. Another outcome of our analysis is the derivation of a path-space information criterion for comparison (and possibly design) of numerical schemes, in analogy to classical information criteria for model selection and discrimination. In the context of Parallel KMC, our analysis allows us to select schemes with improved numerical error and more efficient processor communication. We expect that such a path-space information perspective on numerical methods will be broadly applicable in stochastic dynamics, both for the finite and the long-time regime.

77.8MLMay 12Code

ISOMORPH: A Supply Chain Digital Twin for Simulation, Dataset Generation, and Forecasting Benchmarks

Zhizhen Zhang, Hyemin Gu, Benjamin J. Zhang et al.

Open time-series forecasting (TSF) benchmarks cover retail, energy, weather, and traffic, but supply-chain logistics remains underserved. We introduce ISOMORPH, the first public digital twin of a multi-echelon logistics network with fully interpretable, user-configurable parameters and modular topology, demand process, and control rules. The simulator advances a directed routing graph in discrete time: demand arrives at the destination, is served from stock or recorded as backlog, and triggers replenishment through the network. The state vector tracks per-node on-hand inventory with outstanding orders, in-transit shipments, and a smoothed demand estimate, so the dynamics close as a Markov chain on a tractable state space whose transition kernel acts linearly on the empirical distribution of the state. The released data reproduces the bullwhip effect at empirically consistent magnitudes, and three conservation laws encoded in the Markov chain serve as verification tools when users extend the simulator. We release datasets at two catalogue scales ($C=50$ and $C=200$) with six scenario sweeps producing 30 additional rollouts and 20 Latin-hypercube perturbations, exhibiting dynamics absent from fixed TSF benchmarks: variance amplification, cascading bottlenecks, regime shifts, and cross-channel coupling through shared macro shocks. Zero-shot evaluation of four foundation models (Chronos, Moirai, TimesFM, Lag-Llama) shows MASE values exceeding public GIFT-Eval references at low-to-moderate horizons, supporting incorporation into existing benchmarks. The same pairing produces forecast confidence bands via Latin-hypercube perturbation of demand-side knobs, forward UQ from parameter uncertainty unavailable on standard TSF datasets, demonstrating that foundation models can serve as fast surrogates for the digital twin's forward UQ. Code (MIT): https://github.com/tuhinsahai/ISOMORPH.

95.8LGMay 17

Dimension-Free Convergence of Discrete Diffusion Models: Adjoint Equations Induce the Right Space

Kelvin Kan, Xingjian Li, Benjamin J. Zhang et al.

Discrete diffusion has become a leading framework for generative modeling in various applications including language, vision, and biology. Existing convergence theory, however, exhibits fundamental limitations. KL-based analyses diverge under singular priors such as the masked distribution, while bounds in total variation (TV) depend on the state space size $S$ and become vacuous for modern language tasks, where vocabularies contain hundreds of thousands of tokens. We develop a unified adjoint-equation-based framework that establishes dimension-free convergence guarantees in any integral probability metric (IPM). To the best of our knowledge, our bounds are the first to be entirely free of $S$ and applicable to both masked and uniform priors. Importantly, our theory relies only on a single standard rate-matrix regularity assumption and is compatible with time-inhomogeneous schedules. Four novel techniques drive our improvements: working in the space of observables via adjoint equations rather than directly with probability measures, a regularity analysis that yields bounds on any IPM, a coupling argument that removes $S$-dependence under uniform transitions, and a score-marginal cancellation technique that removes $S$-dependence under masked transitions. Our framework thus sharply departs from prior analyses and avoids the shortcomings of pathspace-KL and existing TV-based approaches. Beyond convergence bounds, our framework provides a versatile toolkit for further theoretical study of discrete diffusion models.

MLFeb 9, 2024

Wasserstein proximal operators describe score-based generative models and resolve memorization

Benjamin J. Zhang, Siting Liu, Wuchen Li et al.

We focus on the fundamental mathematical structure of score-based generative models (SGMs). We first formulate SGMs in terms of the Wasserstein proximal operator (WPO) and demonstrate that, via mean-field games (MFGs), the WPO formulation reveals mathematical structure that describes the inductive bias of diffusion and score-based models. In particular, MFGs yield optimality conditions in the form of a pair of coupled partial differential equations: a forward-controlled Fokker-Planck (FP) equation, and a backward Hamilton-Jacobi-Bellman (HJB) equation. Via a Cole-Hopf transformation and taking advantage of the fact that the cross-entropy can be related to a linear functional of the density, we show that the HJB equation is an uncontrolled FP equation. Second, with the mathematical structure at hand, we present an interpretable kernel-based model for the score function which dramatically improves the performance of SGMs in terms of training samples and training time. In addition, the WPO-informed kernel model is explicitly constructed to avoid the recently studied memorization effects of score-based generative models. The mathematical form of the new kernel-based models in combination with the use of the terminal condition of the MFG reveals new explanations for the manifold learning and generalization properties of SGMs, and provides a resolution to their memorization effects. Finally, our mathematically informed, interpretable kernel-based model suggests new scalable bespoke neural network architectures for high-dimensional applications.

MLMay 24, 2024

Score-based generative models are provably robust: an uncertainty quantification perspective

Nikiforos Mimikos-Stamatopoulos, Benjamin J. Zhang, Markos A. Katsoulakis

Through an uncertainty quantification (UQ) perspective, we show that score-based generative models (SGMs) are provably robust to the multiple sources of error in practical implementation. Our primary tool is the Wasserstein uncertainty propagation (WUP) theorem, a model-form UQ bound that describes how the $L^2$ error from learning the score function propagates to a Wasserstein-1 ($\mathbf{d}_1$) ball around the true data distribution under the evolution of the Fokker-Planck equation. We show how errors due to (a) finite sample approximation, (b) early stopping, (c) score-matching objective choice, (d) score function parametrization expressiveness, and (e) reference distribution choice, impact the quality of the generative model in terms of a $\mathbf{d}_1$ bound of computable quantities. The WUP theorem relies on Bernstein estimates for Hamilton-Jacobi-Bellman partial differential equations (PDE) and the regularizing properties of diffusion processes. Specifically, PDE regularity theory shows that stochasticity is the key mechanism ensuring SGM algorithms are provably robust. The WUP theorem applies to integral probability metrics beyond $\mathbf{d}_1$, such as the total variation distance and the maximum mean discrepancy. Sample complexity and generalization bounds in $\mathbf{d}_1$ follow directly from the WUP theorem. Our approach requires minimal assumptions, is agnostic to the manifold hypothesis and avoids absolute continuity assumptions for the target distribution. Additionally, our results clarify the trade-offs among multiple error sources in SGMs.

LGOct 10, 2025

Stability of Transformers under Layer Normalization

Kelvin Kan, Xingjian Li, Benjamin J. Zhang et al.

Despite their widespread use, training deep Transformers can be unstable. Layer normalization, a standard component, improves training stability, but its placement has often been ad-hoc. In this paper, we conduct a principled study on the forward (hidden states) and backward (gradient) stability of Transformers under different layer normalization placements. Our theory provides key insights into the training dynamics: whether training drives Transformers toward regular solutions or pathological behaviors. For forward stability, we derive explicit bounds on the growth of hidden states in trained Transformers. For backward stability, we analyze how layer normalization affects the backpropagation of gradients, thereby explaining the training dynamics of each layer normalization placement. Our analysis also guides the scaling of residual steps in Transformer blocks, where appropriate choices can further improve stability and performance. Our numerical results corroborate our theoretical findings. Beyond these results, our framework provides a principled way to sanity-check the stability of Transformers under new architectural modifications, offering guidance for future designs.

MLSep 5, 2025

Probabilistic operator learning: generative modeling and uncertainty quantification for foundation models of differential equations

Benjamin J. Zhang, Siting Liu, Stanley J. Osher et al.

In-context operator networks (ICON) are a class of operator learning methods based on the novel architectures of foundation models. Trained on a diverse set of datasets of initial and boundary conditions paired with corresponding solutions to ordinary and partial differential equations (ODEs and PDEs), ICON learns to map example condition-solution pairs of a given differential equation to an approximation of its solution operator. Here, we present a probabilistic framework that reveals ICON as implicitly performing Bayesian inference, where it computes the mean of the posterior predictive distribution over solution operators conditioned on the provided context, i.e., example condition-solution pairs. The formalism of random differential equations provides the probabilistic framework for describing the tasks ICON accomplishes while also providing a basis for understanding other multi-operator learning methods. This probabilistic perspective provides a basis for extending ICON to \emph{generative} settings, where one can sample from the posterior predictive distribution of solution operators. The generative formulation of ICON (GenICON) captures the underlying uncertainty in the solution operator, which enables principled uncertainty quantification in the solution predictions in operator learning.

MLMay 24, 2024

Nonlinear denoising score matching for enhanced learning of structured distributions

Jeremiah Birrell, Markos A. Katsoulakis, Luc Rey-Bellet et al.

We present a novel method for training score-based generative models which uses nonlinear noising dynamics to improve learning of structured distributions. Generalizing to a nonlinear drift allows for additional structure to be incorporated into the dynamics, thus making the training better adapted to the data, e.g., in the case of multimodality or (approximate) symmetries. Such structure can be obtained from the data by an inexpensive preprocessing step. The nonlinear dynamics introduces new challenges into training which we address in two ways: 1) we develop a new nonlinear denoising score matching (NDSM) method, 2) we introduce neural control variates in order to reduce the variance of the NDSM training objective. We demonstrate the effectiveness of this method on several examples: a) a collection of low-dimensional examples, motivated by clustering in latent space, b) high-dimensional images, addressing issues with mode imbalance, small training sets, and approximate symmetries, the latter being a challenge for methods based on equivariant neural networks, which require exact symmetries, c) latent space representation of high-dimensional data, demonstrating improved performance with greatly reduced computational cost. Our method learns score-based generative models with less data by flexibly incorporating structure arising in the dataset.

MLMay 22, 2024

Robust Generative Learning with Lipschitz-Regularized $α$-Divergences Allows Minimal Assumptions on Target Distributions

Ziyu Chen, Hyemin Gu, Markos A. Katsoulakis et al.

This paper demonstrates the robustness of Lipschitz-regularized $α$-divergences as objective functionals in generative modeling, showing they enable stable learning across a wide range of target distributions with minimal assumptions. We establish that these divergences remain finite under a mild condition-that the source distribution has a finite first moment-regardless of the properties of the target distribution, making them adaptable to the structure of target distributions. Furthermore, we prove the existence and finiteness of their variational derivatives, which are essential for stable training of generative models such as GANs and gradient flows. For heavy-tailed targets, we derive necessary and sufficient conditions that connect data dimension, $α$, and tail behavior to divergence finiteness, that also provide insights into the selection of suitable $α$'s. We also provide the first sample complexity bounds for empirical estimations of these divergences on unbounded domains. As a byproduct, we obtain the first sample complexity bounds for empirical estimations of these divergences and the Wasserstein-1 metric with group symmetry on unbounded domains. Numerical experiments confirm that generative models leveraging Lipschitz-regularized $α$-divergences can stably learn distributions in various challenging scenarios, including those with heavy tails or complex, low-dimensional, or fractal support, all without any prior knowledge of the structure of target distributions.

MLMay 22, 2023

Statistical Guarantees of Group-Invariant GANs

Ziyu Chen, Markos A. Katsoulakis, Luc Rey-Bellet et al.

This work presents the first statistical performance guarantees for group-invariant generative models. Many real data, such as images and molecules, are invariant to certain group symmetries, which can be taken advantage of to learn more efficiently as we rigorously demonstrate in this work. Here we specifically study generative adversarial networks (GANs), and quantify the gains when incorporating symmetries into the model. Group-invariant GANs are a type of GANs in which the generators and discriminators are hardwired with group symmetries. Empirical studies have shown that these networks are capable of learning group-invariant distributions with significantly improved data efficiency. In this study, we aim to rigorously quantify this improvement by analyzing the reduction in sample complexity and in the discriminator approximation error for group-invariant GANs. Our findings indicate that when learning group-invariant distributions, the number of samples required for group-invariant GANs decreases proportionally by a factor of the group size and the discriminator approximation error has a reduced lower bound. Importantly, the overall error reduction cannot be achieved merely through data augmentation on the training data. Numerical results substantiate our theory and highlight the stark contrast between learning with group-invariant GANs and using data augmentation. This work also sheds light on the study of other generative models with group symmetries, such as score-based generative models.

LGFeb 2, 2022

Structure-preserving GANs

Jeremiah Birrell, Markos A. Katsoulakis, Luc Rey-Bellet et al.

Generative adversarial networks (GANs), a class of distribution-learning methods based on a two-player game between a generator and a discriminator, can generally be formulated as a minmax problem based on the variational representation of a divergence between the unknown and the generated distributions. We introduce structure-preserving GANs as a data-efficient framework for learning distributions with additional structure such as group symmetry, by developing new variational representations for divergences. Our theory shows that we can reduce the discriminator space to its projection on the invariant discriminator space, using the conditional expectation with respect to the sigma-algebra associated to the underlying structure. In addition, we prove that the discriminator space reduction must be accompanied by a careful design of structured generators, as flawed designs may easily lead to a catastrophic "mode collapse" of the learned distribution. We contextualize our framework by building symmetry-preserving GANs for distributions with intrinsic group symmetry, and demonstrate that both players, namely the equivariant generator and invariant discriminator, play important but distinct roles in the learning process. Empirical experiments and ablation studies across a broad range of data sets, including real-world medical imaging, validate our theory, and show our proposed methods achieve significantly improved sample fidelity and diversity -- almost an order of magnitude measured in Fréchet Inception Distance -- especially in the small data regime.

MLJul 17, 2021

Model Uncertainty and Correctability for Directed Graphical Models

Panagiota Birmpa, Jinchao Feng, Markos A. Katsoulakis et al.

Probabilistic graphical models are a fundamental tool in probabilistic modeling, machine learning and artificial intelligence. They allow us to integrate in a natural way expert knowledge, physical modeling, heterogeneous and correlated data and quantities of interest. For exactly this reason, multiple sources of model uncertainty are inherent within the modular structure of the graphical model. In this paper we develop information-theoretic, robust uncertainty quantification methods and non-parametric stress tests for directed graphical models to assess the effect and the propagation through the graph of multi-sourced model uncertainties to quantities of interest. These methods allow us to rank the different sources of uncertainty and correct the graphical model by targeting its most impactful components with respect to the quantities of interest. Thus, from a machine learning perspective, we provide a mathematically rigorous approach to correctability that guarantees a systematic selection for improvement of components of a graphical model while controlling potential new errors created in the process in other parts of the model. We demonstrate our methods in two physico-chemical examples, namely quantum scale-informed chemical kinetics and materials screening to improve the efficiency of fuel cells.

MLNov 11, 2020

$(f,Γ)$-Divergences: Interpolating between $f$-Divergences and Integral Probability Metrics

Jeremiah Birrell, Paul Dupuis, Markos A. Katsoulakis et al.

We develop a rigorous and general framework for constructing information-theoretic divergences that subsume both $f$-divergences and integral probability metrics (IPMs), such as the $1$-Wasserstein distance. We prove under which assumptions these divergences, hereafter referred to as $(f,Γ)$-divergences, provide a notion of `distance' between probability measures and show that they can be expressed as a two-stage mass-redistribution/mass-transport process. The $(f,Γ)$-divergences inherit features from IPMs, such as the ability to compare distributions which are not absolutely continuous, as well as from $f$-divergences, namely the strict concavity of their variational representations and the ability to control heavy-tailed distributions for particular choices of $f$. When combined, these features establish a divergence with improved properties for estimation, statistical learning, and uncertainty quantification applications. Using statistical learning as an example, we demonstrate their advantage in training generative adversarial networks (GANs) for heavy-tailed, not-absolutely continuous sample distributions. We also show improved performance and stability over gradient-penalized Wasserstein GAN in image generation.

LGSep 7, 2020

Mutual Information for Explainable Deep Learning of Multiscale Systems

Søren Taverniers, Eric J. Hall, Markos A. Katsoulakis et al.

Timely completion of design cycles for complex systems ranging from consumer electronics to hypersonic vehicles relies on rapid simulation-based prototyping. The latter typically involves high-dimensional spaces of possibly correlated control variables (CVs) and quantities of interest (QoIs) with non-Gaussian and possibly multimodal distributions. We develop a model-agnostic, moment-independent global sensitivity analysis (GSA) that relies on differential mutual information to rank the effects of CVs on QoIs. The data requirements of this information-theoretic approach to GSA are met by replacing computationally intensive components of the physics-based model with a deep neural network surrogate. Subsequently, the GSA is used to explain the network predictions, and the surrogate is deployed to close design loops. Viewed as an uncertainty quantification method for interrogating the surrogate, this framework is compatible with a wide variety of black-box models. We demonstrate that the surrogate-driven mutual information GSA provides useful and distinguishable rankings on two applications of interest in energy storage. Consequently, our information-theoretic GSA provides an "outer loop" for accelerated product design by identifying the most and least sensitive input directions and performing subsequent optimization over appropriately reduced parameter subspaces.

MLAug 31, 2020

Uncertainty quantification for Markov Random Fields

Panagiota Birmpa, Markos A. Katsoulakis

We present an information-based uncertainty quantification method for general Markov Random Fields. Markov Random Fields (MRF) are structured, probabilistic graphical models over undirected graphs, and provide a fundamental unifying modeling tool for statistical mechanics, probabilistic machine learning, and artificial intelligence. Typically MRFs are complex and high-dimensional with nodes and edges (connections) built in a modular fashion from simpler, low-dimensional probabilistic models and their local connections; in turn, this modularity allows to incorporate available data to MRFs and efficiently simulate them by leveraging their graph-theoretic structure. Learning graphical models from data and/or constructing them from physical modeling and constraints necessarily involves uncertainties inherited from data, modeling choices, or numerical approximations. These uncertainties in the MRF can be manifested either in the graph structure or the probability distribution functions, and necessarily will propagate in predictions for quantities of interest. Here we quantify such uncertainties using tight, information based bounds on the predictions of quantities of interest; these bounds take advantage of the graphical structure of MRFs and are capable of handling the inherent high-dimensionality of such graphical models. We demonstrate our methods in MRFs for medical diagnostics and statistical mechanics models. In the latter, we develop uncertainty quantification bounds for finite size effects and phase diagrams, which constitute two of the typical predictions goals of statistical mechanics modeling.

MLJul 7, 2020

Variational Representations and Neural Network Estimation of Rényi Divergences

Jeremiah Birrell, Paul Dupuis, Markos A. Katsoulakis et al.

We derive a new variational formula for the Rényi family of divergences, $R_α(Q\|P)$, between probability measures $Q$ and $P$. Our result generalizes the classical Donsker-Varadhan variational formula for the Kullback-Leibler divergence. We further show that this Rényi variational formula holds over a range of function spaces; this leads to a formula for the optimizer under very weak assumptions and is also key in our development of a consistency theory for Rényi divergence estimators. By applying this theory to neural-network estimators, we show that if a neural network family satisfies one of several strengthened versions of the universal approximation property then the corresponding Rényi divergence estimator is consistent. In contrast to density-estimator based methods, our estimators involve only expectations under $Q$ and $P$ and hence are more effective in high dimensional systems. We illustrate this via several numerical examples of neural network estimation in systems of up to 5000 dimensions.

COMP-PHJun 26, 2020

GINNs: Graph-Informed Neural Networks for Multiscale Physics

Eric J. Hall, Søren Taverniers, Markos A. Katsoulakis et al.

We introduce the concept of a Graph-Informed Neural Network (GINN), a hybrid approach combining deep learning with probabilistic graphical models (PGMs) that acts as a surrogate for physics-based representations of multiscale and multiphysics systems. GINNs address the twin challenges of removing intrinsic computational bottlenecks in physics-based models and generating large data sets for estimating probability distributions of quantities of interest (QoIs) with a high degree of confidence. Both the selection of the complex physics learned by the NN and its supervised learning/prediction are informed by the PGM, which includes the formulation of structured priors for tunable control variables (CVs) to account for their mutual correlations and ensure physically sound CV and QoI distributions. GINNs accelerate the prediction of QoIs essential for simulation-based decision-making where generating sufficient sample data using physics-based models alone is often prohibitively expensive. Using a real-world application grounded in supercapacitor-based energy storage, we describe the construction of GINNs from a Bayesian network-embedded homogenized model for supercapacitor dynamics, and demonstrate their ability to produce kernel density estimates of relevant non-Gaussian, skewed QoIs with tight confidence intervals.

LGJun 15, 2020

Optimizing Variational Representations of Divergences and Accelerating their Statistical Estimation

Jeremiah Birrell, Markos A. Katsoulakis, Yannis Pantazis

Variational representations of divergences and distances between high-dimensional probability distributions offer significant theoretical insights and practical advantages in numerous research areas. Recently, they have gained popularity in machine learning as a tractable and scalable approach for training probabilistic models and for statistically differentiating between data distributions. Their advantages include: 1) They can be estimated from data as statistical averages. 2) Such representations can leverage the ability of neural networks to efficiently approximate optimal solutions in function spaces. However, a systematic and practical approach to improving the tightness of such variational formulas, and accordingly accelerate statistical learning and estimation from data, is currently lacking. Here we develop such a methodology for building new, tighter variational representations of divergences. Our approach relies on improved objective functionals constructed via an auxiliary optimization problem. Furthermore, the calculation of the functional Hessian of objective functionals unveils the local curvature differences around the common optimal variational solution; this quantifies and orders the tightness gains between different variational representations. Finally, numerical simulations utilizing neural network optimization demonstrate that tighter representations can result in significantly faster learning and more accurate estimation of divergences in both synthetic and real datasets (of more than 1000 dimensions), often accelerated by nearly an order of magnitude.

NAJan 6, 2019

Causality and Bayesian network PDEs for multiscale representations of porous media

Kimoon Um, Eric Joseph Hall, Markos A. Katsoulakis et al.

Microscopic (pore-scale) properties of porous media affect and often determine their macroscopic (continuum- or Darcy-scale) counterparts. Understanding the relationship between processes on these two scales is essential to both the derivation of macroscopic models of, e.g., transport phenomena in natural porous media, and the design of novel materials, e.g., for energy storage. Most microscopic properties exhibit complex statistical correlations and geometric constraints, which presents challenges for the estimation of macroscopic quantities of interest (QoIs), e.g., in the context of global sensitivity analysis (GSA) of macroscopic QoIs with respect to microscopic material properties. We present a systematic way of building correlations into stochastic multiscale models through Bayesian networks. This allows us to construct the joint probability density function (PDF) of model parameters through causal relationships that emulate engineering processes, e.g., the design of hierarchical nanoporous materials. Such PDFs also serve as input for the forward propagation of parametric uncertainty; our findings indicate that the inclusion of causal relationships impacts predictions of macroscopic QoIs. To assess the impact of correlations and causal relationships between microscopic parameters on macroscopic material properties, we use a moment-independent GSA based on the differential mutual information. Our GSA accounts for the correlated inputs and complex non-Gaussian QoIs. The global sensitivity indices are used to rank the effect of uncertainty in microscopic parameters on macroscopic QoIs, to quantify the impact of causality on the multiscale model's predictions, and to provide physical interpretations of these results for hierarchical nanoporous materials.

PRSep 16, 2018

Robust information divergences for model-form uncertainty arising from sparse data in random PDE

Eric Joseph Hall, Markos A. Katsoulakis

We develop a novel application of hybrid information divergences to analyze uncertainty in steady-state subsurface flow problems. These hybrid information divergences are non-intrusive, goal-oriented uncertainty quantification tools that enable robust, data-informed predictions in support of critical decision tasks such as regulatory assessment and risk management. We study the propagation of model-form or epistemic uncertainty with numerical experiments that demonstrate uncertainty quantification bounds for (i) parametric sensitivity analysis and (ii) model misspecification due to sparse data. Further, we make connections between the hybrid information divergences and certain concentration inequalities that can be leveraged for efficient computing and account for any available data through suitable statistical quantities.

NASep 9, 2016

Uncertainty quantification for generalized Langevin dynamics

Eric Joseph Hall, Markos A. Katsoulakis, Luc Rey-Bellet

We present efficient finite difference estimators for goal-oriented sensitivity indices with applications to the generalized Langevin equation (GLE). In particular, we apply these estimators to analyze an extended variable formulation of the GLE where other well known sensitivity analysis techniques such as the likelihood ratio method are not applicable to key parameters of interest. These easily implemented estimators are formed by coupling the nominal and perturbed dynamics appearing in the finite difference through a common driving noise, or common random path. After developing a general framework for variance reduction via coupling, we demonstrate the optimality of the common random path coupling in the sense that it produces a minimal variance surrogate for the difference estimator relative to sampling dynamics driven by independent paths. In order to build intuition for the common random path coupling, we evaluate the efficiency of the proposed estimators for a comprehensive set of examples of interest in particle dynamics. These reduced variance difference estimators are also a useful tool for performing global sensitivity analysis and for investigating non-local perturbations of parameters, such as increasing the number of Prony modes active in an extended variable GLE.

NAApr 8, 2015

The geometry of generalized force matching in coarse-graining and related information metrics

Evangelia Kalligiannaki, Vagelis Harmandaris, Markos A. Katsoulakis et al.

Using the probabilistic language of conditional expectations we reformulate the force matching method for coarse-graining of molecular systems as a projection on spaces of coarse observables. A practical outcome of this probabilistic description is the link of the force matching method with thermodynamic integration. This connection provides a way to systematically construct a local mean force in order to optimally approximate the potential of mean force through force matching. We introduce a generalized force matching condition for the local mean force in the sense that allows the approximation of the potential of mean force under both linear and non-linear coarse graining mappings (e.g., reaction coordinates, end-to-end length of chains). Furthermore, we study the equivalence of force matching with relative entropy minimization which we derive for general non-linear coarse graining maps. We present in detail the generalized force matching condition through applications to specific examples in molecular systems.

NAAug 1, 2006

Coarse-graining schemes and a posteriori error estimates for stochastic lattice systems

Markos A. Katsoulakis, Petr Plechac, Luc Rey-Bellet et al.

The primary objective of this work is to develop coarse-graining schemes for stochastic many-body microscopic models and quantify their effectiveness in terms of a priori and a posteriori error analysis. In this paper we focus on stochastic lattice systems of interacting particles at equilibrium. %such as Ising-type models. The proposed algorithms are derived from an initial coarse-grained approximation that is directly computable by Monte Carlo simulations, and the corresponding numerical error is calculated using the specific relative entropy between the exact and approximate coarse-grained equilibrium measures. Subsequently we carry out a cluster expansion around this first-and often inadequate-approximation and obtain more accurate coarse-graining schemes. The cluster expansions yield also sharp a posteriori error estimates for the coarse-grained approximations that can be used for the construction of adaptive coarse-graining methods. We present a number of numerical examples that demonstrate that the coarse-graining schemes developed here allow for accurate predictions of critical behavior and hysteresis in systems with intermediate and long-range interactions. We also present examples where they substantially improve predictions of earlier coarse-graining schemes for short-range interactions.