SYJan 14, 2017
Markov-Nash Equilibria in Mean-Field Games with Discounted CostNaci Saldi, Tamer Başar, Maxim Raginsky
In this paper, we consider discrete-time dynamic games of the mean-field type with a finite number $N$ of agents subject to an infinite-horizon discounted-cost optimality criterion. The state space of each agent is a locally compact Polish space. At each time, the agents are coupled through the empirical distribution of their states, which affects both the agents' individual costs and their state transition probabilities. We introduce a new solution concept of the Markov-Nash equilibrium, under which a policy is player-by-player optimal in the class of all Markov policies. Under mild assumptions, we demonstrate the existence of a mean-field equilibrium in the infinite-population limit $N \to \infty$, and then show that the policy obtained from the mean-field equilibrium is approximately Markov-Nash when the number of agents $N$ is sufficiently large.
SYJun 5, 2018
Approximate Nash Equilibria in Partially Observed Stochastic Games with Mean-Field InteractionsNaci Saldi, Tamer Basar, Maxim Raginsky
Establishing the existence of Nash equilibria for partially observed stochastic dynamic games is known to be quite challenging, with the difficulties stemming from the noisy nature of the measurements available to individual players (agents) and the decentralized nature of this information. When the number of players is sufficiently large and the interactions among agents is of the mean-field type, one way to overcome this challenge is to investigate the infinite-population limit of the problem, which leads to a mean-field game. In this paper, we consider discrete-time partially observed mean-field games with infinite-horizon discounted cost criteria. Using the technique of converting the original partially observed stochastic control problem to a fully observed one on the belief space and the dynamic programming principle, we establish the existence of Nash equilibria for these game models under very mild technical conditions. Then, we show that the mean-field equilibrium policy, when adopted by each agent, forms an approximate Nash equilibrium for games with sufficiently many agents.
ITSep 9, 2011
Information-based complexity, feedback and dynamics in convex programmingMaxim Raginsky, Alexander Rakhlin
We study the intrinsic limitations of sequential convex optimization through the lens of feedback information theory. In the oracle model of optimization, an algorithm queries an {\em oracle} for noisy information about the unknown objective function, and the goal is to (approximately) minimize every function in a given class using as few queries as possible. We show that, in order for a function to be optimized, the algorithm must be able to accumulate enough information about the objective. This, in turn, puts limits on the speed of optimization under specific assumptions on the oracle and the type of feedback. Our techniques are akin to the ones used in statistical literature to obtain minimax lower bounds on the risks of estimation procedures; the notable difference is that, unlike in the case of i.i.d. data, a sequential optimization algorithm can gather observations in a {\em controlled} manner, so that the amount of information at each step is allowed to change in time. In particular, we show that optimization algorithms often obey the law of diminishing returns: the signal-to-noise ratio drops as the optimization algorithm approaches the optimum. To underscore the generality of the tools, we use our approach to derive fundamental lower bounds for a certain active learning problem. Overall, the present work connects the intuitive notions of information in optimization, experimental design, estimation, and active learning to the quantitative notion of Shannon information.
OCMar 16, 2023
Variational Principles for Mirror Descent and Mirror Langevin DynamicsBelinda Tzen, Anant Raj, Maxim Raginsky et al.
Mirror descent, introduced by Nemirovski and Yudin in the 1970s, is a primal-dual convex optimization method that can be tailored to the geometry of the optimization problem at hand through the choice of a strongly convex potential function. It arises as a basic primitive in a variety of applications, including large-scale optimization, machine learning, and control. This paper proposes a variational formulation of mirror descent and of its stochastic variant, mirror Langevin dynamics. The main idea, inspired by the classic work of Brezis and Ekeland on variational principles for gradient flows, is to show that mirror descent emerges as a closed-loop solution for a certain optimal control problem, and the Bellman value function is given by the Bregman divergence between the initial condition and the global minimizer of the objective function.
LGSep 8, 2023
Generalization Bounds: Perspectives from Information Theory and PAC-BayesFredrik Hellström, Giuseppe Durisi, Benjamin Guedj et al.
A fundamental question in theoretical machine learning is generalization. Over the past decades, the PAC-Bayesian approach has been established as a flexible framework to address the generalization capabilities of machine learning algorithms, and design new ones. Recently, it has garnered increased interest due to its potential applicability for a variety of learning algorithms, including deep neural networks. In parallel, an information-theoretic view of generalization has developed, wherein the relation between generalization and various information measures has been established. This framework is intimately connected to the PAC-Bayesian approach, and a number of results have been independently discovered in both strands. In this monograph, we highlight this strong connection and present a unified treatment of PAC-Bayesian and information-theoretic generalization bounds. We present techniques and results that the two perspectives have in common, and discuss the approaches and interpretations that differ. In particular, we demonstrate how many proofs in the area share a modular structure, through which the underlying ideas can be intuited. We pay special attention to the conditional mutual information (CMI) framework; analytical studies of the information complexity of learning algorithms; and the application of the proposed methods to deep learning. This monograph is intended to provide a comprehensive introduction to information-theoretic generalization bounds and their connection to PAC-Bayes, serving as a foundation from which the most recent developments are accessible. It is aimed broadly towards researchers with an interest in generalization and theoretical machine learning.
ITOct 4, 2011
Directed information and Pearl's causal calculusMaxim Raginsky
Probabilistic graphical models are a fundamental tool in statistics, machine learning, signal processing, and control. When such a model is defined on a directed acyclic graph (DAG), one can assign a partial ordering to the events occurring in the corresponding stochastic system. Based on the work of Judea Pearl and others, these DAG-based "causal factorizations" of joint probability measures have been used for characterization and inference of functional dependencies (causal links). This mostly expository paper focuses on several connections between Pearl's formalism (and in particular his notion of "intervention") and information-theoretic notions of causality and feedback (such as causal conditioning, directed stochastic kernels, and directed information). As an application, we show how conditional directed information can be used to develop an information-theoretic version of Pearl's "back-door" criterion for identifiability of causal effects from passive observations. This suggests that the back-door criterion can be thought of as a causal analog of statistical sufficiency.
LGApr 3, 2022
Fitting an immersed submanifold to data via Sussmann's orbit theoremJoshua Hanson, Maxim Raginsky
This paper describes an approach for fitting an immersed submanifold of a finite-dimensional Euclidean space to random samples. The reconstruction mapping from the ambient space to the desired submanifold is implemented as a composition of an encoder that maps each point to a tuple of (positive or negative) times and a decoder given by a composition of flows along finitely many vector fields starting from a fixed initial point. The encoder supplies the times for the flows. The encoder-decoder map is obtained by empirical risk minimization, and a high-probability bound is given on the excess risk relative to the minimum expected reconstruction error over a given class of encoder-decoder maps. The proposed approach makes fundamental use of Sussmann's orbit theorem, which guarantees that the image of the reconstruction map is indeed contained in an immersed submanifold.
ITJun 11, 2018
Sequential Empirical Coordination Under an Output Entropy ConstraintEhsan Shafieepoorfard, Maxim Raginsky
This paper considers the problem of sequential empirical coordination, where the objective is to achieve a given value of the expected uniform deviation between state-action empirical averages and statistical expectations under a given strategic probability measure, with respect to a given universal Glivenko-Cantelli class of test functions. A communication constraint is imposed on the Shannon entropy of the resulting action sequence. It is shown that the fundamental limit on the output entropy is given by the minimum of the mutual information between the state and the action processes under all strategic measures that have the same marginal state process as the target measure and approximate the target measure to desired accuracy with respect to the underlying Glivenko--Cantelli seminorm. The fundamental limit is shown to be asymptotically achievable by tree-structured codes.
OCDec 1, 2022
Nonlinear controllability and function representation by neural stochastic differential equationsTanya Veeravalli, Maxim Raginsky
There has been a great deal of recent interest in learning and approximation of functions that can be expressed as expectations of a given nonlinearity with respect to its random internal parameters. Examples of such representations include "infinitely wide" neural nets, where the underlying nonlinearity is given by the activation function of an individual neuron. In this paper, we bring this perspective to function representation by neural stochastic differential equations (SDEs). A neural SDE is an Itô diffusion process whose drift and diffusion matrix are elements of some parametric families. We show that the ability of a neural SDE to realize nonlinear functions of its initial condition can be related to the problem of optimally steering a certain deterministic dynamical system between two given points in finite time. This auxiliary system is obtained by formally replacing the Brownian motion in the SDE by a deterministic control input. We derive upper and lower bounds on the minimum control effort needed to accomplish this steering; these bounds may be of independent interest in the context of motion planning and deterministic optimal control.
PRApr 27, 2023
A Chain Rule for the Expected Suprema of Bernoulli ProcessesYifeng Chu, Maxim Raginsky
We obtain an upper bound on the expected supremum of a Bernoulli process indexed by the image of an index set under a uniformly Lipschitz function class in terms of properties of the index set and the function class, extending an earlier result of Maurer for Gaussian processes. The proof makes essential use of recent results of Bednorz and Latala on the boundedness of Bernoulli processes.
SYJun 16, 2020
Non-signaling Approximations of Stochastic Team ProblemsNaci Saldi, Can Deha Karıksız, Maxim Raginsky et al.
In this paper, we consider non-signaling approximation of finite stochastic teams. We first introduce a hierarchy of team decision rules that can be classified in an increasing order as randomized policies, quantum-correlated policies, and non-signaling policies. Then, we establish an approximation of team-optimal policies for sequential teams via extendible non-signaling policies. We prove that the distance between extendible non-signaling policies and decentralized policies is small if the extension is sufficiently large. Using this result, we establish a linear programming (LP) approximation of sequential teams. Finally, we state an open problem regarding computation of optimal value of quantum-correlated policies.
OCJul 1, 2023
A Constructive Approach to Function Realization by Neural Stochastic Differential EquationsTanya Veeravalli, Maxim Raginsky
The problem of function approximation by neural dynamical systems has typically been approached in a top-down manner: Any continuous function can be approximated to an arbitrary accuracy by a sufficiently complex model with a given architecture. This can lead to high-complexity controls which are impractical in applications. In this paper, we take the opposite, constructive approach: We impose various structural restrictions on system dynamics and consequently characterize the class of functions that can be realized by such a system. The systems are implemented as a cascade interconnection of a neural stochastic differential equation (Neural SDE), a deterministic dynamical system, and a readout map. Both probabilistic and geometric (Lie-theoretic) methods are used to characterize the classes of functions realized by such systems.
33.1SYMay 8
A Behavioral Framework for Data-Driven Modeling of Nonlinear Systems in Vector-Valued Reproducing Kernel Hilbert SpacesBoya Hou, Maxim Raginsky
We generalize Jan Willems' behavioral approach to a class of discrete-time nonlinear systems in a vector-valued reproducing kernel Hilbert space (RKHS). Apart from linear time-invariant systems, this class covers nonlinear systems modeled by Volterra series and their autoregressive variants, as well as systems admitting Hammerstein-type state-space realizations. We apply the proposed framework to the problem of data-driven modeling of such systems, i.e., when simulation or control objectives for an unknown system are carried out without an explicit system identification step. To that end, we link the behavioral approach to two data-driven modeling methods in a vector-valued RKHS: (1) minimum-norm interpolation and (2) subspace identification.
73.0ITMay 6
Information-theoretic Limits of Learning and EstimationAbbas El Gamal, Maxim Raginsky
Information theory plays a central role in establishing fundamental limits on what any learning or estimation algorithm can -- and cannot -- achieve, regardless of computational power. In this chapter, we provide an introduction to these connections. End-of-chapter exercises makes the material suitable for both classroom use and self-study. We begin by introducing concentration inequalities along with the notions of covering and packing in metric spaces, and the associated concept of metric entropy. These tools are essential for our analysis. We then introduce the learning-theoretic framework and derive upper bounds on generalization error in terms of metric entropy, Rademacher complexity, and the VC dimension, as well as mutual information and relative entropy. Finally we discuss the minimax estimation framework and establish lower bounds on minimax risk using Fano's inequality, yielding bounds in terms of relative entropy and covering and packing numbers. This manuscript contains preprint of a chapter under consideration for inclusion in the forthcoming third edition of Cover and Thomas's Elements of Information Theory, posted with permission from Wiley. It would follow the chapter posted at arXiv:2605.02989 . The table of contents of the new edition can be found at: https://docs.google.com/document/d/1L-m4oQEJw1PJhoxBeMwrrBD8S_HmvzMEkPbYvS24980/edit?usp=sharing . For feedback, please contact abbas@ee.stanford.edu.
CLJan 23, 2024
Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural RecursionDylan Zhang, Curt Tigges, Zory Zhang et al.
This paper investigates the ability of transformer-based models to learn structural recursion from examples. Recursion is a universal concept in both natural and formal languages. Structural recursion is central to the programming language and formal mathematics tasks where symbolic tools currently excel beyond neural models, such as inferring semantic relations between datatypes and emulating program behavior. We introduce a general framework that nicely connects the abstract concepts of structural recursion in the programming language domain to concrete sequence modeling problems and learned models' behavior. The framework includes a representation that captures the general \textit{syntax} of structural recursion, coupled with two different frameworks for understanding their \textit{semantics} -- one that is more natural from a programming languages perspective and one that helps bridge that perspective with a mechanistic understanding of the underlying transformer architecture. With our framework as a powerful conceptual tool, we identify different issues under various set-ups. The models trained to emulate recursive computations cannot fully capture the recursion yet instead fit short-cut algorithms and thus cannot solve certain edge cases that are under-represented in the training distribution. In addition, it is difficult for state-of-the-art large language models (LLMs) to mine recursive rules from in-context demonstrations. Meanwhile, these LLMs fail in interesting ways when emulating reduction (step-wise computation) of the recursive function.
87.1LGApr 21
Separating Geometry from Probability in the Analysis of GeneralizationMaxim Raginsky, Benjamin Recht
The goal of machine learning is to find models that minimize prediction error on data that has not yet been seen. Its operational paradigm assumes access to a dataset $S$ and articulates a scheme for evaluating how well a given model performs on an arbitrary sample. The sample can be $S$ (in which case we speak of ``in-sample'' performance) or some entirely new $S'$ (in which case we speak of ``out-of-sample'' performance). Traditional analysis of generalization assumes that both in- and out-of-sample data are i.i.d.\ draws from an infinite population. However, these probabilistic assumptions cannot be verified even in principle. This paper presents an alternative view of generalization through the lens of sensitivity analysis of solutions of optimization problems to perturbations in the problem data. Under this framework, generalization bounds are obtained by purely deterministic means and take the form of variational principles that relate in-sample and out-of-sample evaluations through an error term that quantifies how close out-of-sample data are to in-sample data. Statistical assumptions can then be used \textit{ex post} to characterize the situations when this error term is small (either on average or with high probability).
MLJan 30, 2024
Rademacher Complexity of Neural ODEs via Chen-Fliess SeriesJoshua Hanson, Maxim Raginsky
We show how continuous-depth neural ODE models can be framed as single-layer, infinite-width nets using the Chen--Fliess series expansion for nonlinear ODEs. In this net, the output ``weights'' are taken from the signature of the control input -- a tool used to represent infinite-dimensional paths as a sequence of tensors -- which comprises iterated integrals of the control input over a simplex. The ``features'' are taken to be iterated Lie derivatives of the output function with respect to the vector fields in the controlled ODE model. The main result of this work applies this framework to derive compact expressions for the Rademacher complexity of ODE models that map an initial condition to a scalar output at some terminal time. The result leverages the straightforward analysis afforded by single-layer architectures. We conclude with some examples instantiating the bound for some specific systems and discuss potential follow-up work.
LGApr 13, 2025
Expressivity of Quadratic Neural ODEsJoshua Hanson, Maxim Raginsky
This work focuses on deriving quantitative approximation error bounds for neural ordinary differential equations having at most quadratic nonlinearities in the dynamics. The simple dynamics of this model form demonstrates how expressivity can be derived primarily from iteratively composing many basic elementary operations, versus from the complexity of those elementary operations themselves. Like the analog differential analyzer and universal polynomial DAEs, the expressivity is derived instead primarily from the "depth" of the model. These results contribute to our understanding of what depth specifically imparts to the capabilities of deep learning architectures.
LGMay 24, 2023
Can Transformers Learn to Solve Problems Recursively?Shizhuo Dylan Zhang, Curt Tigges, Stella Biderman et al.
Neural networks have in recent years shown promise for helping software engineers write programs and even formally verify them. While semantic information plays a crucial part in these processes, it remains unclear to what degree popular neural architectures like transformers are capable of modeling that information. This paper examines the behavior of neural networks learning algorithms relevant to programs and formal verification proofs through the lens of mechanistic interpretability, focusing in particular on structural recursion. Structural recursion is at the heart of tasks on which symbolic tools currently outperform neural models, like inferring semantic relations between datatypes and emulating program behavior. We evaluate the ability of transformer models to learn to emulate the behavior of structurally recursive functions from input-output examples. Our evaluation includes empirical and conceptual analyses of the limitations and capabilities of transformer models in approximating these functions, as well as reconstructions of the ``shortcut" algorithms the model learns. By reconstructing these algorithms, we are able to correctly predict 91 percent of failure cases for one of the approximated functions. Our work provides a new foundation for understanding the behavior of neural networks that fail to solve the very tasks they are trained for.
LGMay 18, 2023
A unified framework for information-theoretic generalization boundsYifeng Chu, Maxim Raginsky
This paper presents a general methodology for deriving information-theoretic generalization bounds for learning algorithms. The main technical tool is a probabilistic decorrelation lemma based on a change of measure and a relaxation of Young's inequality in $L_{ψ_p}$ Orlicz spaces. Using the decorrelation lemma in combination with other techniques, such as symmetrization, couplings, and chaining in the space of probability measures, we obtain new upper bounds on the generalization error, both in expectation and in high probability, and recover as special cases many of the existing generalization bounds, including the ones based on mutual information, conditional mutual information, stochastic chaining, and PAC-Bayes inequalities. In addition, the Fernique-Talagrand upper bound on the expected supremum of a subgaussian process emerges as a special case.
LGFeb 14, 2022
Input-to-State Stable Neural Ordinary Differential Equations with Applications to Transient Modeling of CircuitsAlan Yang, Jie Xiong, Maxim Raginsky et al.
This paper proposes a class of neural ordinary differential equations parametrized by provably input-to-state stable continuous-time recurrent neural networks. The model dynamics are defined by construction to be input-to-state stable (ISS) with respect to an ISS-Lyapunov function that is learned jointly with the dynamics. We use the proposed method to learn cheap-to-simulate behavioral models for electronic circuits that can accurately reproduce the behavior of various digital and analog circuits when simulated by a commercial circuit simulator, even when interconnected with circuit components not encountered during training. We also demonstrate the feasibility of learning ISS-preserving perturbations to the dynamics for modeling degradation effects due to circuit aging.
LGOct 4, 2021
Information-theoretic generalization bounds for black-box learning algorithmsHrayr Harutyunyan, Maxim Raginsky, Greg Ver Steeg et al.
We derive information-theoretic generalization bounds for supervised learning algorithms based on the information contained in predictions rather than in the output of the training algorithm. These bounds improve over the existing information-theoretic bounds, are applicable to a wider range of algorithms, and solve two key challenges: (a) they give meaningful results for deterministic algorithms and (b) they are significantly easier to estimate. We show experimentally that the proposed bounds closely follow the generalization gap in practical scenarios for deep learning.
LGDec 29, 2020
Minimum Excess Risk in Bayesian LearningAolin Xu, Maxim Raginsky
We analyze the best achievable performance of Bayesian learning under generative models by defining and upper-bounding the minimum excess risk (MER): the gap between the minimum expected loss attainable by learning from data and the minimum expected loss that could be achieved if the model realization were known. The definition of MER provides a principled way to define different notions of uncertainties in Bayesian learning, including the aleatoric uncertainty and the minimum epistemic uncertainty. Two methods for deriving upper bounds for the MER are presented. The first method, generally suitable for Bayesian learning with a parametric generative model, upper-bounds the MER by the conditional mutual information between the model parameters and the quantity being predicted given the observed data. It allows us to quantify the rate at which the MER decays to zero as more data becomes available. Under realizable models, this method also relates the MER to the richness of the generative function class, notably the VC dimension in binary classification. The second method, particularly suitable for Bayesian learning with a parametric predictive model, relates the MER to the minimum estimation error of the model parameters from data. It explicitly shows how the uncertainty in model parameter estimation translates to the MER and to the final prediction uncertainty. We also extend the definition and analysis of MER to the setting with multiple model families and the setting with nonparametric models. Along the discussions we draw some comparisons between the MER in Bayesian learning and the excess risk in frequentist learning.
LGNov 18, 2020
Learning Recurrent Neural Net Models of Nonlinear SystemsJoshua Hanson, Maxim Raginsky, Eduardo Sontag
We consider the following learning problem: Given sample pairs of input and output signals generated by an unknown nonlinear system (which is not assumed to be causal or time-invariant), we wish to find a continuous-time recurrent neural net with hyperbolic tangent activation function that approximately reproduces the underlying i/o behavior with high confidence. Leveraging earlier work concerned with matching output derivatives up to a given finite order, we reformulate the learning problem in familiar system-theoretic language and derive quantitative guarantees on the sup-norm risk of the learned model in terms of the number of neurons, the sample size, the number of derivatives being matched, and the regularity properties of the inputs, the outputs, and the unknown i/o map.
LGFeb 5, 2020
Function approximation by neural nets in the mean-field regime: Entropic regularization and controlled McKean-Vlasov dynamicsBelinda Tzen, Maxim Raginsky
We consider the problem of function approximation by two-layer neural nets with random weights that are "nearly Gaussian" in the sense of Kullback-Leibler divergence. Our setting is the mean-field limit, where the finite population of neurons in the hidden layer is replaced by a continuous ensemble. We show that the problem can be phrased as global minimization of a free energy functional on the space of (finite-length) paths over probability measures on the weights. This functional trades off the $L^2$ approximation risk of the terminal measure against the KL divergence of the path with respect to an isotropic Brownian motion prior. We characterize the unique global minimizer and examine the dynamics in the space of probability measures over weights that can achieve it. In particular, we show that the optimal path-space measure corresponds to the Föllmer drift, the solution to a McKean-Vlasov optimal control problem closely related to the classic Schrödinger bridge problem. While the Föllmer drift cannot in general be obtained in closed form, thus limiting its potential algorithmic utility, we illustrate the viability of the mean-field Langevin diffusion as a finite-time approximation under various conditions on entropic regularization. Specifically, we show that it closely tracks the Föllmer drift when the regularization is such that the minimizing density is log-concave.
LGNov 12, 2019
Model-Augmented Estimation of Conditional Mutual Information for Feature SelectionAlan Yang, AmirEmad Ghassami, Maxim Raginsky et al.
Markov blanket feature selection, while theoretically optimal, is generally challenging to implement. This is due to the shortcomings of existing approaches to conditional independence (CI) testing, which tend to struggle either with the curse of dimensionality or computational complexity. We propose a novel two-step approach which facilitates Markov blanket feature selection in high dimensions. First, neural networks are used to map features to low-dimensional representations. In the second step, CI testing is performed by applying the $k$-NN conditional mutual information estimator to the learned feature maps. The mappings are designed to ensure that mapped samples both preserve information and share similar information about the target variable if and only if they are close in Euclidean distance. We show that these properties boost the performance of the $k$-NN estimator in the second step. The performance of the proposed method is evaluated on both synthetic and real data.
LGJun 21, 2019
Universal Approximation of Input-Output Maps by Temporal Convolutional NetsJoshua Hanson, Maxim Raginsky
There has been a recent shift in sequence-to-sequence modeling from recurrent network architectures to convolutional network architectures due to computational advantages in training and operation while still achieving competitive performance. For systems having limited long-term temporal dependencies, the approximation capability of recurrent networks is essentially equivalent to that of temporal convolutional nets (TCNs). We prove that TCNs can approximate a large class of input-output maps having approximately finite memory to arbitrary error tolerance. Furthermore, we derive quantitative approximation rates for deep ReLU TCNs in terms of the width and depth of the network and modulus of continuity of the original input-output map, and apply these results to input-output maps of systems that admit finite-dimensional state-space realizations (i.e., recurrent models).
LGMay 23, 2019
Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion LimitBelinda Tzen, Maxim Raginsky
In deep latent Gaussian models, the latent variable is generated by a time-inhomogeneous Markov chain, where at each time step we pass the current state through a parametric nonlinear map, such as a feedforward neural net, and add a small independent Gaussian perturbation. This work considers the diffusion limit of such models, where the number of layers tends to infinity, while the step size and the noise variance tend to zero. The limiting latent object is an Itô diffusion process that solves a stochastic differential equation (SDE) whose drift and diffusion coefficient are implemented by neural nets. We develop a variational inference framework for these \textit{neural SDEs} via stochastic automatic differentiation in Wiener space, where the variational approximations to the posterior are obtained by Girsanov (mean-shift) transformation of the standard Wiener process and the computation of gradients is based on the theory of stochastic flows. This permits the use of black-box SDE solvers and automatic differentiation for end-to-end inference. Experimental results with synthetic data are provided.
PRMar 5, 2019
Theoretical guarantees for sampling and inference in generative models with latent diffusionsBelinda Tzen, Maxim Raginsky
We introduce and study a class of probabilistic generative models, where the latent object is a finite-dimensional diffusion process on a finite time interval and the observed variable is drawn conditionally on the terminal point of the diffusion. We make the following contributions: We provide a unified viewpoint on both sampling and variational inference in such generative models through the lens of stochastic control. We quantify the expressiveness of diffusion-based generative models. Specifically, we show that one can efficiently sample from a wide class of terminal target distributions by choosing the drift of the latent diffusion from the class of multilayer feedforward neural nets, with the accuracy of sampling measured by the Kullback-Leibler divergence to the target distribution. Finally, we present and analyze a scheme for unbiased simulation of generative models with latent diffusions and provide bounds on the variance of the resulting estimators. This scheme can be implemented as a deep generative model with a random number of layers.
MLDec 23, 2018
Learning finite-dimensional coding schemes with nonlinear reconstruction mapsJaeho Lee, Maxim Raginsky
This paper generalizes the Maurer--Pontil framework of finite-dimensional lossy coding schemes to the setting where a high-dimensional random vector is mapped to an element of a compact set of latent representations in a lower-dimensional Euclidean space, and the reconstruction map belongs to a given class of nonlinear maps. Under this setup, which encompasses a broad class of unsupervised representation learning problems, we establish a connection to approximate generative modeling under structural constraints using the tools from the theory of optimal transportation. Next, we consider problem of learning a coding scheme on the basis of a finite collection of training samples and present generalization bounds that hold with high probability. We then illustrate the general theory in the setting where the reconstruction maps are implemented by deep neural nets.
OCOct 4, 2018
Discrete-time Risk-sensitive Mean-field GamesNaci Saldi, Tamer Basar, Maxim Raginsky
In this paper, we study a class of discrete-time mean-field games under the infinite-horizon risk-sensitive discounted-cost optimality criterion. Risk-sensitivity is introduced for each agent (player) via an exponential utility function. In this game model, each agent is coupled with the rest of the population through the empirical distribution of the states, which affects both the agent's individual cost and its state dynamics. Under mild assumptions, we establish the existence of a mean-field equilibrium in the infinite-population limit as the number of agents ($N$) goes to infinity, and then show that the policy obtained from the mean-field equilibrium constitutes an approximate Nash equilibrium when $N$ is sufficiently large.
LGFeb 18, 2018
Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical MetastabilityBelinda Tzen, Tengyuan Liang, Maxim Raginsky
We study the detailed path-wise behavior of the discrete-time Langevin algorithm for non-convex Empirical Risk Minimization (ERM) through the lens of metastability, adopting some techniques from Berglund and Gentz (2003. For a particular local optimum of the empirical risk, with an arbitrary initialization, we show that, with high probability, at least one of the following two events will occur: (1) the Langevin trajectory ends up somewhere outside the $\varepsilon$-neighborhood of this particular optimum within a short recurrence time; (2) it enters this $\varepsilon$-neighborhood by the recurrence time and stays there until a potentially exponentially long escape time. We call this phenomenon empirical metastability. This two-timescale characterization aligns nicely with the existing literature in the following two senses. First, the effective recurrence time (i.e., number of iterations multiplied by stepsize) is dimension-independent, and resembles the convergence time of continuous-time deterministic Gradient Descent (GD). However unlike GD, the Langevin algorithm does not require strong conditions on local initialization, and has the possibility of eventually visiting all optima. Second, the scaling of the escape time is consistent with the Eyring-Kramers law, which states that the Langevin scheme will eventually visit all local minima, but it will take an exponentially long time to transit among them. We apply this path-wise concentration result in the context of statistical learning to examine local notions of generalization and optimality.
LGMay 22, 2017
Minimax Statistical Learning with Wasserstein DistancesJaeho Lee, Maxim Raginsky
As opposed to standard empirical risk minimization (ERM), distributionally robust optimization aims to minimize the worst-case risk over a larger ambiguity set containing the original empirical distribution of the training data. In this work, we describe a minimax framework for statistical learning with ambiguity sets given by balls in Wasserstein space. In particular, we prove generalization bounds that involve the covering number properties of the original ERM problem. As an illustrative example, we provide generalization guarantees for transport-based domain adaptation problems where the Wasserstein distance between the source and target domain distributions can be reliably estimated from unlabeled samples.
LGMay 22, 2017
Information-theoretic analysis of generalization capability of learning algorithmsAolin Xu, Maxim Raginsky
We derive upper bounds on the generalization error of a learning algorithm in terms of the mutual information between its input and output. The bounds provide an information-theoretic understanding of generalization in learning problems, and give theoretical guidelines for striking the right balance between data fit and generalization by controlling the input-output mutual information. We propose a number of methods for this purpose, among which are algorithms that regularize the ERM algorithm with relative entropy or with random noise. Our work extends and leads to nontrivial improvements on the recent results of Russo and Zou.
LGMay 19, 2017
EE-Grad: Exploration and Exploitation for Cost-Efficient Mini-Batch SGDMehmet A. Donmez, Maxim Raginsky, Andrew C. Singer
We present a generic framework for trading off fidelity and cost in computing stochastic gradients when the costs of acquiring stochastic gradients of different quality are not known a priori. We consider a mini-batch oracle that distributes a limited query budget over a number of stochastic gradients and aggregates them to estimate the true gradient. Since the optimal mini-batch size depends on the unknown cost-fidelity function, we propose an algorithm, {\it EE-Grad}, that sequentially explores the performance of mini-batch oracles and exploits the accumulated knowledge to estimate the one achieving the best performance in terms of cost-efficiency. We provide performance guarantees for EE-Grad with respect to the optimal mini-batch oracle, and illustrate these results in the case of strongly convex objectives. We also provide a simple numerical example that corroborates our theoretical findings.
LGFeb 13, 2017
Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysisMaxim Raginsky, Alexander Rakhlin, Matus Telgarsky
Stochastic Gradient Langevin Dynamics (SGLD) is a popular variant of Stochastic Gradient Descent, where properly scaled isotropic Gaussian noise is added to an unbiased estimate of the gradient at each iteration. This modest change allows SGLD to escape local minima and suffices to guarantee asymptotic convergence to global minimizers for sufficiently regular non-convex objectives (Gelfand and Mitter, 1991). The present work provides a nonasymptotic analysis in the context of non-convex learning problems, giving finite-time guarantees for SGLD to find approximate minimizers of both empirical and population risks. As in the asymptotic setting, our analysis relates the discrete-time SGLD Markov chain to a continuous-time diffusion process. A new tool that drives the results is the use of weighted transportation cost inequalities to quantify the rate of convergence of SGLD to a stationary distribution in the Euclidean $2$-Wasserstein distance.
OCAug 31, 2015
Coordinate Dual Averaging for Decentralized Online Optimization with Nonseparable Global ObjectivesSoomin Lee, Angelia Nedić, Maxim Raginsky
We consider a decentralized online convex optimization problem in a network of agents, where each agent controls only a coordinate (or a part) of the global decision vector. For such a problem, we propose two decentralized variants (ODA-C and ODA-PS) of Nesterov's primal-dual algorithm with dual averaging. In ODA-C, to mitigate the disagreements on the primal-vector updates, the agents implement a generalization of the local information-exchange dynamics recently proposed by Li and Marden over a static undirected graph. In ODA-PS, the agents implement the broadcast-based push-sum dynamics over a time-varying sequence of uniformly connected digraphs. We show that the regret bounds in both cases have sublinear growth of $O(\sqrt{T})$, with the time horizon $T$, when the stepsize is of the form $1/\sqrt{t}$ and the objective functions are Lipschitz-continuous convex functions with Lipschitz gradients. We also implement the proposed algorithms on a sensor network to complement our theoretical analysis.
OCJan 14, 2014
Online Markov decision processes with Kullback-Leibler control costPeng Guan, Maxim Raginsky, Rebecca Willett
This paper considers an online (real-time) control problem that involves an agent performing a discrete-time random walk over a finite state space. The agent's action at each time step is to specify the probability distribution for the next state given the current state. Following the set-up of Todorov, the state-action cost at each time step is a sum of a state cost and a control cost given by the Kullback-Leibler (KL) divergence between the agent's next-state distribution and that determined by some fixed passive dynamics. The online aspect of the problem is due to the fact that the state cost functions are generated by a dynamic environment, and the agent learns the current state cost only after selecting an action. An explicit construction of a computationally efficient strategy with small regret (i.e., expected difference between its actual total cost and the smallest cost attainable using noncausal knowledge of the state costs) under mild regularity conditions is presented, along with a demonstration of the performance of the proposed strategy on a simulated target tracking problem. A number of new results on Markov decision processes with KL control cost are also obtained.
LGOct 28, 2013
Relax but stay in control: from value to algorithms for online Markov decision processesPeng Guan, Maxim Raginsky, Rebecca Willett
Online learning algorithms are designed to perform in non-stationary environments, but generally there is no notion of a dynamic state to model constraints on current and future actions as a function of past actions. State-based models are common in stochastic control settings, but commonly used frameworks such as Markov Decision Processes (MDPs) assume a known stationary environment. In recent years, there has been a growing interest in combining the above two frameworks and considering an MDP setting in which the cost function is allowed to change arbitrarily after each time step. However, most of the work in this area has been algorithmic: given a problem, one would develop an algorithm almost from scratch. Moreover, the presence of the state and the assumption of an arbitrarily varying environment complicate both the theoretical analysis and the development of computationally efficient methods. This paper describes a broad extension of the ideas proposed by Rakhlin et al. to give a general framework for deriving algorithms in an MDP setting with arbitrarily changing costs. This framework leads to a unifying view of existing methods and provides a general procedure for constructing new ones. Several new methods are presented, and one of them is shown to have important advantages over a similar method developed from scratch via an online version of approximate dynamic programming.
OCJul 1, 2013
Online discrete optimization in social networks in the presence of Knightian uncertaintyMaxim Raginsky, Angelia Nedić
We study a model of collective real-time decision-making (or learning) in a social network operating in an uncertain environment, for which no a priori probabilistic model is available. Instead, the environment's impact on the agents in the network is seen through a sequence of cost functions, revealed to the agents in a causal manner only after all the relevant actions are taken. There are two kinds of costs: individual costs incurred by each agent and local-interaction costs incurred by each agent and its neighbors in the social network. Moreover, agents have inertia: each agent has a default mixed strategy that stays fixed regardless of the state of the environment, and must expend effort to deviate from this strategy in order to respond to cost signals coming from the environment. We construct a decentralized strategy, wherein each agent selects its action based only on the costs directly affecting it and on the decisions made by its neighbors in the network. In this setting, we quantify social learning in terms of regret, which is given by the difference between the realized network performance over a given time horizon and the best performance that could have been achieved in hindsight by a fictitious centralized entity with full knowledge of the environment's evolution. We show that our strategy achieves the regret that scales polylogarithmically with the time horizon and polynomially with the number of agents and the maximum number of neighbors of any agent in the social network.