Andrea Montanari

ML
h-index27
74papers
9,350citations
Novelty54%
AI Score52

74 Papers

SYOct 20, 2011
Distributed Storage for Intermittent Energy Sources: Control Design and Performance Limits

Yashodhan Kanoria, Andrea Montanari, David Tse et al. · stanford

One of the most important challenges in the integration of renewable energy sources into the power grid lies in their `intermittent' nature. The power output of sources like wind and solar varies with time and location due to factors that cannot be controlled by the provider. Two strategies have been proposed to hedge against this variability: 1) use energy storage systems to effectively average the produced power over time; 2) exploit distributed generation to effectively average production over location. We introduce a network model to study the optimal use of storage and transmission resources in the presence of random energy sources. We propose a Linear-Quadratic based methodology to design control strategies, and we show that these strategies are asymptotically optimal for some simple network topologies. For these topologies, the dependence of optimal performance on storage and transmission capacity is explicitly quantified.

STNov 21, 2012
Localization from Incomplete Noisy Distance Measurements

Adel Javanmard, Andrea Montanari

We consider the problem of positioning a cloud of points in the Euclidean space $\mathbb{R}^d$, using noisy measurements of a subset of pairwise distances. This task has applications in various areas, such as sensor network localization and reconstruction of protein conformations from NMR measurements. Also, it is closely related to dimensionality reduction problems and manifold learning, where the goal is to learn the underlying global geometry of a data set using local (or partial) metric information. Here we propose a reconstruction algorithm based on semidefinite programming. For a random geometric graph model and uniformly bounded noise, we provide a precise characterization of the algorithm's performance: In the noiseless case, we find a radius $r_0$ beyond which the algorithm reconstructs the exact positions (up to rigid transformations). In the presence of noise, we obtain upper and lower bounds on the reconstruction error that match up to a factor that depends only on the dimension $d$, and the average degree of the nodes in the graph.

LGFeb 28, 2023
Learning time-scales in two-layers neural networks

Raphaël Berthier, Andrea Montanari, Kangjie Zhou

Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically `simpler' or `easier to learn' although in a way that is difficult to formalize. Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the gradient flow dynamics of a wide two-layer neural network in high-dimension, when data are distributed according to a single-index model (i.e., the target function depends on a one-dimensional projection of the covariates). Based on a mixture of new rigorous results, non-rigorous mathematical derivations, and numerical simulations, we propose a scenario for the learning dynamics in this setting. In particular, the proposed evolution exhibits separation of timescales and intermittency. These behaviors arise naturally because the population gradient flow can be recast as a singularly perturbed dynamical system.

MLSep 25, 2023
Towards a statistical theory of data selection under weak supervision

Germain Kolossov, Andrea Montanari, Pulkit Tandon

Given a sample of size $N$, it is often useful to select a subsample of smaller size $n<N$ to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the computational complexity of learning. We assume to be given $N$ unlabeled samples $\{{\boldsymbol x}_i\}_{i\le N}$, and to be given access to a `surrogate model' that can predict labels $y_i$ better than random guessing. Our goal is to select a subset of the samples, to be denoted by $\{{\boldsymbol x}_i\}_{i\in G}$, of size $|G|=n<N$. We then acquire labels for this set and we use them to train a model via regularized empirical risk minimization. By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under low- and high- dimensional asymptotics, we show that: $(i)$~Data selection can be very effective, in particular beating training on the full sample in some cases; $(ii)$~Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal.

MLAug 25, 2023
Six Lectures on Linearized Neural Networks

Theodor Misiakiewicz, Andrea Montanari

In these six lectures, we examine what can be learnt about the behavior of multi-layer neural networks from the analysis of linear models. We first recall the correspondence between neural networks and linear models via the so-called lazy regime. We then review four models for linearized neural networks: linear regression with concentrated features, kernel ridge regression, random feature model and neural tangent model. Finally, we highlight the limitations of the linear theory and discuss how other approaches can overcome them.

MLJun 14, 2022
Overparametrized linear dimensionality reductions: From projection pursuit to two-layer neural networks

Andrea Montanari, Kangjie Zhou

Given a cloud of $n$ data points in $\mathbb{R}^d$, consider all projections onto $m$-dimensional subspaces of $\mathbb{R}^d$ and, for each such projection, the empirical distribution of the projected points. What does this collection of probability distributions look like when $n,d$ grow large? We consider this question under the null model in which the points are i.i.d. standard Gaussian vectors, focusing on the asymptotic regime in which $n,d\to\infty$, with $n/d\toα\in (0,\infty)$, while $m$ is fixed. Denoting by $\mathscr{F}_{m, α}$ the set of probability distributions in $\mathbb{R}^m$ that arise as low-dimensional projections in this limit, we establish new inner and outer bounds on $\mathscr{F}_{m, α}$. In particular, we characterize the Wasserstein radius of $\mathscr{F}_{m,α}$ up to constant multiplicative factors, and determine it exactly for $m=1$. We also prove sharp bounds in terms of Kullback-Leibler divergence and Rényi information dimension. The previous question has application to unsupervised learning methods, such as projection pursuit and independent component analysis. We introduce a version of the same problem that is relevant for supervised learning, and prove a sharp Wasserstein radius bound. As an application, we establish an upper bound on the interpolation threshold of two-layers neural networks with $m$ hidden neurons.

CVJul 25, 2024
Scaling Training Data with Lossy Image Compression

Katherine L. Mentzer, Andrea Montanari

Empirically-determined scaling laws have been broadly successful in predicting the evolution of large machine learning models with training data and number of parameters. As a consequence, they have been useful for optimizing the allocation of limited resources, most notably compute time. In certain applications, storage space is an important constraint, and data format needs to be chosen carefully as a consequence. Computer vision is a prominent example: images are inherently analog, but are always stored in a digital format using a finite number of bits. Given a dataset of digital images, the number of bits $L$ to store each of them can be further reduced using lossy data compression. This, however, can degrade the quality of the model trained on such images, since each example has lower resolution. In order to capture this trade-off and optimize storage of training data, we propose a `storage scaling law' that describes the joint evolution of test error with sample size and number of bits per image. We prove that this law holds within a stylized model for image compression, and verify it empirically on two computer vision tasks, extracting the relevant parameters. We then show that this law can be used to optimize the lossy compression level. At given storage, models trained on optimally compressed images present a significantly smaller test error with respect to models trained on the original data. Finally, we investigate the potential benefits of randomizing the compression level.

LGMar 21
Generating from Discrete Distributions Using Diffusions: Insights from Random Constraint Satisfaction Problems

Alankrita Bhatt, Mukur Gupta, Germain Kolossov et al.

Generating data from discrete distributions is important for a number of application domains including text, tabular data, and genomic data. Several groups have recently used random $k$-satisfiability ($k$-SAT) as a synthetic benchmark for new generative techniques. In this paper, we show that fundamental insights from the theory of random constraint satisfaction problems have observable implications (sometime contradicting intuition) on the behavior of generative techniques on such benchmarks. More precisely, we study the problem of generating a uniformly random solution of a given (random) $k$-SAT or $k$-XORSAT formula. Among other findings, we observe that: $(i)$~Continuous diffusions outperform masked discrete diffusions; $(ii)$~Learned diffusions can match the theoretical `ideal' accuracy; $(iii)$~Smart ordering of the variables can significantly improve accuracy, although not following popular heuristics.

LGFeb 6, 2024
Scaling laws for learning with real and surrogate data

Ayush Jain, Andrea Montanari, Eren Sasoglu

Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as `surrogate data'. We study a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein's paradox. $(ii)$ In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM. $(iii)$ The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.

MLFeb 28, 2025
Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks

Andrea Montanari, Pierfrancesco Urbani

Understanding the inductive bias and generalization properties of large overparametrized machine learning models requires to characterize the dynamics of the training algorithm. We study the learning dynamics of large two-layer neural networks via dynamical mean field theory, a well established technique of non-equilibrium statistical physics. We show that, for large network width $m$, and large number of samples per input dimension $n/d$, the training dynamics exhibits a separation of timescales which implies: $(i)$~The emergence of a slow time scale associated with the growth in Gaussian/Rademacher complexity of the network; $(ii)$~Inductive bias towards small complexity if the initialization has small enough complexity; $(iii)$~A dynamical decoupling between feature learning and overfitting regimes; $(iv)$~A non-monotone behavior of the test error, associated `feature unlearning' regime at large times.

MLFeb 4, 2025
Local minima of the empirical risk in high dimension: General theorems and convex examples

Kiana Asgari, Andrea Montanari, Basil Saeed

We consider a general model for high-dimensional empirical risk minimization whereby the data $\mathbf{x}_i$ are $d$-dimensional isotropic Gaussian vectors, the model is parametrized by $\mathbfΘ\in\mathbb{R}^{d\times k}$, and the loss depends on the data via the projection $\mathbfΘ^\mathsf{T}\mathbf{x}_i$. This setting covers as special cases classical statistics methods (e.g. multinomial regression and other generalized linear models), but also two-layer fully connected neural networks with $k$ hidden neurons. We use the Kac-Rice formula from Gaussian process theory to derive a bound on the expected number of local minima of this empirical risk, under the proportional asymptotics in which $n,d\to\infty$, with $n\asymp d$. Via Markov's inequality, this bound allows to determine the positions of these minimizers (with exponential deviation bounds) and hence derive sharp asymptotics on the estimation and prediction error. In this paper, we apply our characterization to convex losses, where high-dimensional asymptotics were not (in general) rigorously established for $k\ge 2$. We show that our approach is tight and allows to prove previously conjectured results. In addition, we characterize the spectrum of the Hessian at the minimizer. A companion paper applies our general result to non-convex examples.

LGFeb 1
Phase Transitions for Feature Learning in Neural Networks

Andrea Montanari, Zihao Wang

According to a popular viewpoint, neural networks learn from data by first identifying low-dimensional representations, and subsequently fitting the best model in this space. Recent works provide a formalization of this phenomenon when learning multi-index models. In this setting, we are given $n$ i.i.d. pairs $({\boldsymbol x}_i,y_i)$, where the covariate vectors ${\boldsymbol x}_i\in\mathbb{R}^d$ are isotropic, and responses $y_i$ only depend on ${\boldsymbol x}_i$ through a $k$-dimensional projection ${\boldsymbol Θ}_*^{\sf T}{\boldsymbol x}_i$. Feature learning amounts to learning the latent space spanned by ${\boldsymbol Θ}_*$. In this context, we study the gradient descent dynamics of two-layer neural networks under the proportional asymptotics $n,d\to\infty$, $n/d\toδ$, while the dimension of the latent space $k$ and the number of hidden neurons $m$ are kept fixed. Earlier work establishes that feature learning via polynomial-time algorithms is possible if $δ> δ_{\text{alg}}$, for $δ_{\text{alg}}$ a threshold depending on the data distribution, and is impossible (within a certain class of algorithms) below $δ_{\text{alg}}$. Here we derive an analogous threshold $δ_{\text{NN}}$ for two-layer networks. Our characterization of $δ_{\text{NN}}$ opens the way to study the dependence of learning dynamics on the network architecture and training algorithm. The threshold $δ_{\text{NN}}$ is determined by the following scenario. Training first visits points for which the gradient of the empirical risk is large and learns the directions spanned by these gradients. Then the gradient becomes smaller and the dynamics becomes dominated by negative directions of the Hessian. The threshold $δ_{\text{NN}}$ corresponds to a phase transition in the spectrum of the Hessian in this second phase.

MLMar 11, 2025
Computational bottlenecks for denoising diffusions

Andrea Montanari, Viet Vu

Denoising diffusions sample from a probability distribution $μ$ in $\mathbb{R}^d$ by constructing a stochastic process $({\hat{\boldsymbol x}}_t:t\ge 0)$ in $\mathbb{R}^d$ such that ${\hat{\boldsymbol x}}_0$ is easy to sample, but the distribution of $\hat{\boldsymbol x}_T$ at large $T$ approximates $μ$. The drift ${\boldsymbol m}:\mathbb{R}^d\times\mathbb{R}\to\mathbb{R}^d$ of this diffusion process is learned my minimizing a score-matching objective. Is every probability distribution $μ$, for which sampling is tractable, also amenable to sampling via diffusions? We provide evidence to the contrary by studying a probability distribution $μ$ for which sampling is easy, but the drift of the diffusion process is intractable -- under a popular conjecture on information-computation gaps in statistical estimation. We show that there exist drifts that are superpolynomially close to the optimum value (among polynomial time drifts) and yet yield samples with distribution that is very far from the target one.

LGOct 1, 2025
Train on Validation (ToV): Fast data selection with applications to fine-tuning

Ayush Jain, Andrea Montanari, Eren Sasoglu

State-of-the-art machine learning often follows a two-stage process: $(i)$~pre-training on large, general-purpose datasets; $(ii)$~fine-tuning on task-specific data. In fine-tuning, selecting training examples that closely reflect the target distribution is crucial. However, it is often the case that only a few samples are available from the target distribution. Existing data selection methods treat these target samples as a validation set and estimate the effect of adding or removing a single sample from the training pool by performing inference on the validation set. We propose a simpler and faster alternative that inverts the usual role of train and validation: we perform inference on the training pool before and after fine-tuning on the validation set. We then select samples whose predictions change the most. Our key insight is that the training samples most affected by fine-tuning on a small validation set tend to be the most beneficial for reducing test loss on the target distribution. Experiments on instruction tuning and named entity recognition tasks show that, in most cases, our method achieves lower test log-loss than state-of-the-art approaches. We support our findings with theoretical analysis.

PRJun 5, 2024
Which exceptional low-dimensional projections of a Gaussian point cloud can be found in polynomial time?

Andrea Montanari, Kangjie Zhou

Given $d$-dimensional standard Gaussian vectors $\boldsymbol{x}_1,\dots, \boldsymbol{x}_n$, we consider the set of all empirical distributions of its $m$-dimensional projections, for $m$ a fixed constant. Diaconis and Freedman (1984) proved that, if $n/d\to \infty$, all such distributions converge to the standard Gaussian distribution. In contrast, we study the proportional asymptotics, whereby $n,d\to \infty$ with $n/d\to α\in (0, \infty)$. In this case, the projection of the data points along a typical random subspace is again Gaussian, but the set $\mathscr{F}_{m,α}$ of all probability distributions that are asymptotically feasible as $m$-dimensional projections contains non-Gaussian distributions corresponding to exceptional subspaces. Non-rigorous methods from statistical physics yield an indirect characterization of $\mathscr{F}_{m,α}$ in terms of a generalized Parisi formula. Motivated by the goal of putting this formula on a rigorous basis, and to understand whether these projections can be found efficiently, we study the subset $\mathscr{F}^{\rm alg}_{m,α}\subseteq \mathscr{F}_{m,α}$ of distributions that can be realized by a class of iterative algorithms. We prove that this set is characterized by a certain stochastic optimal control problem, and obtain a dual characterization of this problem in terms of a variational principle that extends Parisi's formula. As a byproduct, we obtain computationally achievable values for a class of random optimization problems including `generalized spherical perceptron' models.

LGMay 18, 2023
Sampling, Diffusions, and Stochastic Localization

Andrea Montanari

Diffusions are a successful technique to sample from high-dimensional distributions. The target distribution can be either explicitly given or learnt from a collection of samples. They implement a diffusion process whose endpoint is a sample from the target distribution. The drift of the diffusion process is typically represented as a neural network. Stochastic localization is a successful technique to prove mixing of Markov Chains and other functional inequalities in high dimension. An algorithmic version of stochastic localization was recently proposed in order to sample from certain statistical mechanics models. This expository article has three objectives: $(i)$~Generalize the algorithmic construction to other stochastic localization processes. This construction is both simple and broadly applicable; $(ii)$~Clarify the connection between diffusions and stochastic localization. This allows to derive several known sampling schemes in a unified fashion; $(iii)$~Describe the insights that follow from this unified viewpoint.

LGMar 31, 2022
Adversarial Examples in Random Neural Networks with General Activations

Andrea Montanari, Yuchen Wu

A substantial body of empirical work documents the lack of robustness in deep learning models to adversarial examples. Recent theoretical work proved that adversarial examples are ubiquitous in two-layers networks with sub-exponential width and ReLU or smooth activations, and multi-layer ReLU networks with sub-exponential width. We present a result of the same type, with no restriction on width and for general locally Lipschitz continuous activations. More precisely, given a neural network $f(\,\cdot\,;{\boldsymbol θ})$ with random weights ${\boldsymbol θ}$, and feature vector ${\boldsymbol x}$, we show that an adversarial example ${\boldsymbol x}'$ can be found with high probability along the direction of the gradient $\nabla_{\boldsymbol x}f({\boldsymbol x};{\boldsymbol θ})$. Our proof is based on a Gaussian conditioning technique. Instead of proving that $f$ is approximately linear in a neighborhood of ${\boldsymbol x}$, we characterize the joint distribution of $f({\boldsymbol x};{\boldsymbol θ})$ and $f({\boldsymbol x}';{\boldsymbol θ})$ for ${\boldsymbol x}' = {\boldsymbol x}-s({\boldsymbol x})\nabla_{\boldsymbol x}f({\boldsymbol x};{\boldsymbol θ})$.

STFeb 17, 2022
Universality of empirical risk minimization

Andrea Montanari, Basil Saeed

Consider supervised learning from i.i.d. samples $\{{\boldsymbol x}_i,y_i\}_{i\le n}$ where ${\boldsymbol x}_i \in\mathbb{R}^p$ are feature vectors and ${y} \in \mathbb{R}$ are labels. We study empirical risk minimization over a class of functions that are parameterized by $\mathsf{k} = O(1)$ vectors ${\boldsymbol θ}_1, . . . , {\boldsymbol θ}_{\mathsf k} \in \mathbb{R}^p$ , and prove universality results both for the training and test error. Namely, under the proportional asymptotics $n,p\to\infty$, with $n/p = Θ(1)$, we prove that the training error depends on the random features distribution only through its covariance structure. Further, we prove that the minimum test error over near-empirical risk minimizers enjoys similar universality properties. In particular, the asymptotics of these quantities can be computed $-$to leading order$-$ under a simpler model in which the feature vectors ${\boldsymbol x}_i$ are replaced by Gaussian vectors ${\boldsymbol g}_i$ with the same covariance. Earlier universality results were limited to strongly convex learning procedures, or to feature vectors ${\boldsymbol x}_i$ with independent entries. Our results do not make any of these assumptions. Our assumptions are general enough to include feature vectors ${\boldsymbol x}_i$ that are produced by randomized featurization maps. In particular we explicitly check the assumptions for certain random features models (computing the output of a one-layer neural network with random weights) and neural tangent models (first-order Taylor approximation of two-layer networks).

LGOct 28, 2021
Tractability from overparametrization: The example of the negative perceptron

Andrea Montanari, Yiqiao Zhong, Kangjie Zhou

In the negative perceptron problem we are given $n$ data points $({\boldsymbol x}_i,y_i)$, where ${\boldsymbol x}_i$ is a $d$-dimensional vector and $y_i\in\{+1,-1\}$ is a binary label. The data are not linearly separable and hence we content ourselves to find a linear classifier with the largest possible \emph{negative} margin. In other words, we want to find a unit norm vector ${\boldsymbol θ}$ that maximizes $\min_{i\le n}y_i\langle {\boldsymbol θ},{\boldsymbol x}_i\rangle$. This is a non-convex optimization problem (it is equivalent to finding a maximum norm vector in a polytope), and we study its typical properties under two random models for the data. We consider the proportional asymptotics in which $n,d\to \infty$ with $n/d\toδ$, and prove upper and lower bounds on the maximum margin $κ_{\text{s}}(δ)$ or -- equivalently -- on its inverse function $δ_{\text{s}}(κ)$. In other words, $δ_{\text{s}}(κ)$ is the overparametrization threshold: for $n/d\le δ_{\text{s}}(κ)-\varepsilon$ a classifier achieving vanishing training error exists with high probability, while for $n/d\ge δ_{\text{s}}(κ)+\varepsilon$ it does not. Our bounds on $δ_{\text{s}}(κ)$ match to the leading order as $κ\to -\infty$. We then analyze a linear programming algorithm to find a solution, and characterize the corresponding threshold $δ_{\text{lin}}(κ)$. We observe a gap between the interpolation threshold $δ_{\text{s}}(κ)$ and the linear programming threshold $δ_{\text{lin}}(κ)$, raising the question of the behavior of other algorithms.

MLJun 9, 2021
Streaming Belief Propagation for Community Detection

Yuchen Wu, MohammadHossein Bateni, Andre Linhares et al.

The community detection problem requires to cluster the nodes of a network into a small number of well-connected "communities". There has been substantial recent progress in characterizing the fundamental statistical limits of community detection under simple stochastic block models. However, in real-world applications, the network structure is typically dynamic, with nodes that join over time. In this setting, we would like a detection algorithm to perform only a limited number of updates at each node arrival. While standard voting approaches satisfy this constraint, it is unclear whether they exploit the network information optimally. We introduce a simple model for networks growing over time which we refer to as streaming stochastic block model (StSBM). Within this model, we prove that voting algorithms have fundamental limitations. We also develop a streaming belief-propagation (StreamBP) approach, for which we prove optimality in certain regimes. We validate our theoretical findings on synthetic and real data.

LGMar 30, 2021
Minimum complexity interpolation in random features models

Michael Celentano, Theodor Misiakiewicz, Andrea Montanari

Despite their many appealing properties, kernel methods are heavily affected by the curse of dimensionality. For instance, in the case of inner product kernels in $\mathbb{R}^d$, the Reproducing Kernel Hilbert Space (RKHS) norm is often very large for functions that depend strongly on a small subset of directions (ridge functions). Correspondingly, such functions are difficult to learn using kernel methods. This observation has motivated the study of generalizations of kernel methods, whereby the RKHS norm -- which is equivalent to a weighted $\ell_2$ norm -- is replaced by a weighted functional $\ell_p$ norm, which we refer to as $\mathcal{F}_p$ norm. Unfortunately, tractability of these approaches is unclear. The kernel trick is not available and minimizing these norms requires to solve an infinite-dimensional convex problem. We study random features approximations to these norms and show that, for $p>1$, the number of random features required to approximate the original learning problem is upper bounded by a polynomial in the sample size. Hence, learning with $\mathcal{F}_p$ norms is tractable in these cases. We introduce a proof technique based on uniform concentration in the dual, which can be of broader interest in the study of overparametrized models. For $p= 1$, our guarantees for the random features approximation break down. We prove instead that learning with the $\mathcal{F}_1$ norm is $\mathsf{NP}$-hard under a randomized reduction based on the problem of learning halfspaces with noise.

STMar 16, 2021
Deep learning: a statistical viewpoint

Peter L. Bartlett, Andrea Montanari, Alexander Rakhlin

The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.

MLFeb 25, 2021
Learning with invariances in random features and kernel models

Song Mei, Theodor Misiakiewicz, Andrea Montanari

A number of machine learning tasks entail a high degree of invariance: the data distribution does not change if we act on the data with a certain group of transformations. For instance, labels of images are invariant under translations of the images. Certain neural network architectures -- for instance, convolutional networks -- are believed to owe their success to the fact that they exploit such invariance properties. With the objective of quantifying the gain achieved by invariant architectures, we introduce two classes of models: invariant random features and invariant kernel methods. The latter includes, as a special case, the neural tangent kernel for convolutional networks with global average pooling. We consider uniform covariates distributions on the sphere and hypercube and a general invariant target function. We characterize the test error of invariant methods in a high-dimensional regime in which the sample size and number of hidden units scale as polynomials in the dimension, for a class of groups that we call `degeneracy $α$', with $α\leq 1$. We show that exploiting invariance in the architecture saves a $d^α$ factor ($d$ stands for the dimension) in sample size and number of hidden units to achieve the same test error as for unstructured architectures. Finally, we show that output symmetrization of an unstructured kernel estimator does not give a significant statistical improvement; on the other hand, data augmentation with an unstructured kernel estimator is equivalent to an invariant kernel estimator and enjoys the same improvement in statistical efficiency.

STJan 26, 2021
Generalization error of random features and kernel methods: hypercontractivity and kernel matrix concentration

Song Mei, Theodor Misiakiewicz, Andrea Montanari

Consider the classical supervised learning problem: we are given data $(y_i,{\boldsymbol x}_i)$, $i\le n$, with $y_i$ a response and ${\boldsymbol x}_i\in {\mathcal X}$ a covariates vector, and try to learn a model $f:{\mathcal X}\to{\mathbb R}$ to predict future responses. Random features methods map the covariates vector ${\boldsymbol x}_i$ to a point ${\boldsymbol φ}({\boldsymbol x}_i)$ in a higher dimensional space ${\mathbb R}^N$, via a random featurization map ${\boldsymbol φ}$. We study the use of random features methods in conjunction with ridge regression in the feature space ${\mathbb R}^N$. This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime. We define a class of problems satisfying certain spectral conditions on the underlying kernels, and a hypercontractivity assumption on the associated eigenfunctions. These conditions are verified by classical high-dimensional examples. Under these conditions, we prove a sharp characterization of the error of random features ridge regression. In particular, we address two fundamental questions: $(1)$~What is the generalization error of KRR? $(2)$~How big $N$ should be for the random features approximation to achieve the same error as KRR? In this setting, we prove that KRR is well approximated by a projection onto the top $\ell$ eigenfunctions of the kernel, where $\ell$ depends on the sample size $n$. We show that the test error of random features ridge regression is dominated by its approximation error and is larger than the error of KRR as long as $N\le n^{1-δ}$ for some $δ>0$. We characterize this gap. For $N\ge n^{1+δ}$, random features achieve the same error as the corresponding KRR, and further increasing $N$ does not lead to a significant change in test error.

LGNov 6, 2020
Underspecification Presents Challenges for Credibility in Modern Machine Learning

Alexander D'Amour, Katherine Heller, Dan Moldovan et al.

ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.

STJul 27, 2020
The Lasso with general Gaussian designs with applications to hypothesis testing

Michael Celentano, Andrea Montanari, Yuting Wei

The Lasso is a method for high-dimensional regression, which is now commonly used when the number of covariates $p$ is of the same order or larger than the number of observations $n$. Classical asymptotic normality theory does not apply to this model due to two fundamental reasons: $(1)$ The regularized risk is non-smooth; $(2)$ The distance between the estimator $\widehat{\boldsymbolθ}$ and the true parameters vector $\boldsymbolθ^*$ cannot be neglected. As a consequence, standard perturbative arguments that are the traditional basis for asymptotic normality fail. On the other hand, the Lasso estimator can be precisely characterized in the regime in which both $n$ and $p$ are large and $n/p$ is of order one. This characterization was first obtained in the case of Gaussian designs with i.i.d. covariates: here we generalize it to Gaussian correlated designs with non-singular covariance structure. This is expressed in terms of a simpler ``fixed-design'' model. We establish non-asymptotic bounds on the distance between the distribution of various quantities in the two models, which hold uniformly over signals $\boldsymbolθ^*$ in a suitable sparsity class and over values of the regularization parameter. As an application, we study the distribution of the debiased Lasso and show that a degrees-of-freedom correction is necessary for computing valid confidence intervals.

MLJul 25, 2020
The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training

Andrea Montanari, Yiqiao Zhong

Modern neural networks are often operated in a strongly overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones. Despite this, they achieve good prediction error on unseen data: interpolating the training set does not lead to a large generalization error. Further, overparametrization appears to be beneficial in that it simplifies the optimization landscape. Here we study these phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We consider a simple data model, with isotropic covariates vectors in $d$ dimensions, and $N$ hidden neurons. We assume that both the sample size $n$ and the dimension $d$ are large, and they are polynomially related. Our first main result is a characterization of the eigenstructure of the empirical NT kernel in the overparametrized regime $Nd\gg n$. This characterization implies as a corollary that the minimum eigenvalue of the empirical NT kernel is bounded away from zero as soon as $Nd\gg n$, and therefore the network can exactly interpolate arbitrary labels in the same regime. Our second main result is a characterization of the generalization error of NT ridge regression including, as a special case, min-$\ell_2$ norm interpolation. We prove that, as soon as $Nd\gg n$, the test error is well approximated by the one of kernel ridge regression with respect to the infinite-width kernel. The latter is in turn well approximated by the error of polynomial ridge regression, whereby the regularization parameter is increased by a `self-induced' term related to the high-degree components of the activation function. The polynomial degree depends on the sample size and the dimension (in particular on $\log n/\log d$).

MLJun 24, 2020
When Do Neural Networks Outperform Kernel Methods?

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz et al.

For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGD-trained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization. How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If covariates are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation. Here we show that this curse of dimensionality becomes milder if the covariates display the same low-dimensional structure as the target function, and we precisely characterize this tradeoff. Building on these results, we present the spiked covariates model that can capture in a unified framework both behaviors observed in earlier work. We hypothesize that such a latent low-dimensional structure is present in image classification. We test numerically this hypothesis by showing that specific perturbations of the training distribution degrade the performances of RKHS methods much more significantly than NNs.

MLFeb 28, 2020
The estimation error of general first order methods

Michael Celentano, Andrea Montanari, Yuchen Wu

Modern large-scale statistical models require to estimate thousands to millions of parameters. This is often accomplished by iterative algorithms such as gradient descent, projected gradient descent or their accelerated versions. What are the fundamental limits to these approaches? This question is well understood from an optimization viewpoint when the underlying objective is convex. Work in this area characterizes the gap to global optimality as a function of the number of iterations. However, these results have only indirect implications in terms of the gap to statistical optimality. Here we consider two families of high-dimensional estimation problems: high-dimensional regression and low-rank matrix estimation, and introduce a class of `general first order methods' that aim at efficiently estimating the underlying parameters. This class of algorithms is broad enough to include classical first order optimization (for convex and non-convex objectives), but also other types of algorithms. Under a random design assumption, we derive lower bounds on the estimation error that hold in the high-dimensional asymptotics in which both the number of observations and the number of parameters diverge. These lower bounds are optimal in the sense that there exist algorithms whose estimation error matches the lower bounds up to asymptotically negligible terms. We illustrate our general results through applications to sparse phase retrieval and sparse principal component analysis.

STJan 24, 2020
Imputation for High-Dimensional Linear Regression

Kabir Aladin Chandrasekher, Ahmed El Alaoui, Andrea Montanari

We study high-dimensional regression with missing entries in the covariates. A common strategy in practice is to \emph{impute} the missing entries with an appropriate substitute and then implement a standard statistical procedure acting as if the covariates were fully observed. Recent literature on this subject proposes instead to design a specific, often complicated or non-convex, algorithm tailored to the case of missing covariates. We investigate a simpler approach where we fill-in the missing entries with their conditional mean given the observed covariates. We show that this imputation scheme coupled with standard off-the-shelf procedures such as the LASSO and square-root LASSO retains the minimax estimation rate in the random-design setting where the covariates are i.i.d.\ sub-Gaussian. We further show that the square-root LASSO remains \emph{pivotal} in this setting. It is often the case that the conditional expectation cannot be computed exactly and must be approximated from data. We study two cases where the covariates either follow an autoregressive (AR) process, or are jointly Gaussian with sparse precision matrix. We propose tractable estimators for the conditional expectation and then perform linear regression via LASSO, and show similar estimation rates in both cases. We complement our theoretical results with simulations on synthetic and semi-synthetic examples, illustrating not only the sharpness of our bounds, but also the broader utility of this strategy beyond our theoretical assumptions.

STNov 5, 2019
The generalization error of max-margin linear classifiers: Benign overfitting and high dimensional asymptotics in the overparametrized regime

Andrea Montanari, Feng Ruan, Youngtak Sohn et al.

Modern machine learning classifiers often exhibit vanishing classification error on the training set. They achieve this by learning nonlinear representations of the inputs that maps the data into linearly separable classes. Motivated by these phenomena, we revisit high-dimensional maximum margin classification for linearly separable data. We consider a stylized setting in which data $(y_i,{\boldsymbol x}_i)$, $i\le n$ are i.i.d. with ${\boldsymbol x}_i\sim\mathsf{N}({\boldsymbol 0},{\boldsymbol Σ})$ a $p$-dimensional Gaussian feature vector, and $y_i \in\{+1,-1\}$ a label whose distribution depends on a linear combination of the covariates $\langle {\boldsymbol θ}_*,{\boldsymbol x}_i \rangle$. While the Gaussian model might appear extremely simplistic, universality arguments can be used to show that the results derived in this setting also apply to the output of certain nonlinear featurization maps. We consider the proportional asymptotics $n,p\to\infty$ with $p/n\to ψ$, and derive exact expressions for the limiting generalization error. We use this theory to derive two results of independent interest: $(i)$ Sufficient conditions on $({\boldsymbol Σ},{\boldsymbol θ}_*)$ for `benign overfitting' that parallel previously derived conditions in the case of linear regression; $(ii)$ An asymptotically exact expression for the generalization error when max-margin classification is used in conjunction with feature vectors produced by random one-layer neural networks.

STAug 14, 2019
The generalization error of random features regression: Precise asymptotics and double descent curve

Song Mei, Andrea Montanari

Deep learning methods operate in regimes that defy the traditional statistical mindset. Neural network architectures often contain more parameters than training samples, and are so rich that they can interpolate the observed labels, even if the latter are replaced by pure noise. Despite their huge complexity, the same architectures achieve small generalization error on real data. This phenomenon has been rationalized in terms of a so-called `double descent' curve. As the model complexity increases, the test error follows the usual U-shaped curve at the beginning, first decreasing and then peaking around the interpolation threshold (when the model achieves vanishing training error). However, it descends again as model complexity exceeds this threshold. The global minimum of the test error is found above the interpolation threshold, often in the extreme overparametrization regime in which the number of parameters is much larger than the number of samples. Far from being a peculiar property of deep neural networks, elements of this behavior have been demonstrated in much simpler settings, including linear regression with random covariates. In this paper we consider the problem of learning an unknown function over the $d$-dimensional sphere $\mathbb S^{d-1}$, from $n$ i.i.d. samples $(\boldsymbol x_i, y_i)\in \mathbb S^{d-1} \times \mathbb R$, $i\le n$. We perform ridge regression on $N$ random features of the form $σ(\boldsymbol w_a^{\mathsf T} \boldsymbol x)$, $a\le N$. This can be equivalently described as a two-layers neural network with random first-layer weights. We compute the precise asymptotics of the test error, in the limit $N,n,d\to \infty$ with $N/d$ and $n/d$ fixed. This provides the first analytically tractable model that captures all the features of the double descent phenomenon without assuming ad hoc misspecification structures.

MLJun 21, 2019
Limitations of Lazy Training of Two-layers Neural Networks

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz et al.

We study the supervised learning problem under either of the following two models: (1) Feature vectors ${\boldsymbol x}_i$ are $d$-dimensional Gaussians and responses are $y_i = f_*({\boldsymbol x}_i)$ for $f_*$ an unknown quadratic function; (2) Feature vectors ${\boldsymbol x}_i$ are distributed as a mixture of two $d$-dimensional centered Gaussians, and $y_i$'s are the corresponding class labels. We use two-layers neural networks with quadratic activations, and compare three different learning regimes: the random features (RF) regime in which we only train the second-layer weights; the neural tangent (NT) regime in which we train a linearization of the neural network around its initialization; the fully trained neural network (NN) regime in which we train all the weights in the network. We prove that, even for the simple quadratic model of point (1), there is a potentially unbounded gap between the prediction risk achieved in these three training regimes, when the number of neurons is smaller than the ambient dimension. When the number of neurons is larger than the number of dimensions, the problem is significantly easier and both NT and NN learning achieve zero risk.

STApr 27, 2019
Linearized two-layers neural networks in high dimension

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz et al.

We consider the problem of learning an unknown function $f_{\star}$ on the $d$-dimensional sphere with respect to the square loss, given i.i.d. samples $\{(y_i,{\boldsymbol x}_i)\}_{i\le n}$ where ${\boldsymbol x}_i$ is a feature vector uniformly distributed on the sphere and $y_i=f_{\star}({\boldsymbol x}_i)+\varepsilon_i$. We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons $N$ diverges, for a fixed dimension $d$. We consider two specific regimes: the approximation-limited regime, in which $n=\infty$ while $d$ and $N$ are large but finite; and the sample size-limited regime in which $N=\infty$ while $d$ and $n$ are large but finite. In the first regime we prove that if $d^{\ell + δ} \le N\le d^{\ell+1-δ}$ for small $δ> 0$, then \RF\, effectively fits a degree-$\ell$ polynomial in the raw features, and \NT\, fits a degree-$(\ell+1)$ polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is $d^{\ell + δ} \le n \le d^{\ell +1-δ}$, then kernel methods can fit at most a a degree-$\ell$ polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization.

STMar 19, 2019
Surprises in High-Dimensional Ridgeless Least Squares Interpolation

Trevor Hastie, Andrea Montanari, Saharon Rosset et al.

Interpolators -- estimators that achieve zero training error -- have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum $\ell_2$ norm ("ridgeless") interpolation in high-dimensional least squares regression. We consider two different models for the feature distribution: a linear model, where the feature vectors $x_i \in {\mathbb R}^p$ are obtained by applying a linear transform to a vector of i.i.d. entries, $x_i = Σ^{1/2} z_i$ (with $z_i \in {\mathbb R}^p$); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, $x_i = \varphi(W z_i)$ (with $z_i \in {\mathbb R}^d$, $W \in {\mathbb R}^{p \times d}$ a matrix of i.i.d. entries, and $\varphi$ an activation function acting componentwise on $W z_i$). We recover -- in a precise quantitative way -- several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.

MLFeb 16, 2019
Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

Song Mei, Theodor Misiakiewicz, Andrea Montanari

We consider learning two layer neural networks using stochastic gradient descent. The mean-field description of this learning dynamics approximates the evolution of the network weights by an evolution in the space of probability distributions in $R^D$ (where $D$ is the number of parameters associated to each neuron). This evolution can be defined through a partial differential equation or, equivalently, as the gradient flow in the Wasserstein space of probability distributions. Earlier work shows that (under some regularity assumptions), the mean field description is accurate as soon as the number of hidden units is much larger than the dimension $D$. In this paper we establish stronger and more general approximation guarantees. First of all, we show that the number of hidden units only needs to be larger than a quantity dependent on the regularity properties of the data, and independent of the dimensions. Next, we generalize this analysis to the case of unbounded activation functions, which was not covered by earlier bounds. We extend our results to noisy stochastic gradient descent. Finally, we show that kernel ridge regression can be recovered as a special limit of the mean field analysis.

STJan 5, 2019
Analysis of a Two-Layer Neural Network via Displacement Convexity

Adel Javanmard, Marco Mondelli, Andrea Montanari

Fitting a function by using linear combinations of a large number $N$ of `simple' components is one of the most fruitful ideas in statistical learning. This idea lies at the core of a variety of methods, from two-layer neural networks to kernel regression, to boosting. In general, the resulting risk minimization problem is non-convex and is solved by gradient descent or its variants. Unfortunately, little is known about global convergence properties of these approaches. Here we consider the problem of learning a concave function $f$ on a compact convex domain $Ω\subseteq {\mathbb R}^d$, using linear combinations of `bump-like' components (neurons). The parameters to be fitted are the centers of $N$ bumps, and the resulting empirical risk minimization problem is highly non-convex. We prove that, in the limit in which the number of neurons diverges, the evolution of gradient descent converges to a Wasserstein gradient flow in the space of probability distributions over $Ω$. Further, when the bump width $δ$ tends to $0$, this gradient flow has a limit which is a viscous porous medium equation. Remarkably, the cost function optimized by this gradient flow exhibits a special property known as displacement convexity, which implies exponential convergence rates for $N\to\infty$, $δ\to 0$. Surprisingly, this asymptotic theory appears to capture well the behavior for moderate values of $δ, N$. Explaining this phenomenon, and understanding the dependence on $δ,N$ in a quantitative manner remains an outstanding challenge.

SIJul 23, 2018
Contextual Stochastic Block Models

Yash Deshpande, Andrea Montanari, Elchanan Mossel et al.

We provide the first information theoretic tight analysis for inference of latent community structure given a sparse graph along with high dimensional node covariates, correlated with the same latent communities. Our work bridges recent theoretical breakthroughs in the detection of latent community structure without nodes covariates and a large body of empirical work using diverse heuristics for combining node covariates with graphs for inference. The tightness of our analysis implies in particular, the information theoretical necessity of combining the different sources of information. Our analysis holds for networks of large degrees as well as for a Gaussian version of the model.

MLApr 18, 2018
A Mean Field View of the Landscape of Two-Layers Neural Networks

Song Mei, Andrea Montanari, Phan-Minh Nguyen

Multi-layer neural networks are among the most powerful models in machine learning, yet the fundamental reasons for this success defy mathematical understanding. Learning a neural network requires to optimize a non-convex high-dimensional objective (risk function), a problem which is usually attacked using stochastic gradient descent (SGD). Does SGD converge to a global optimum of the risk or only to a local optimum? In the first case, does this happen because local minima are absent, or because SGD somehow avoids them? In the second, why do local minima reached by SGD have good generalization properties? In this paper we consider a simple case, namely two-layers neural networks, and prove that -in a suitable scaling limit- SGD dynamics is captured by a certain non-linear partial differential equation (PDE) that we call distributional dynamics (DD). We then consider several specific examples, and show how DD can be used to prove convergence of SGD to networks with nearly ideal generalization error. This description allows to 'average-out' some of the complexities of the landscape of neural networks, and can be used to prove a general convergence result for noisy SGD.

LGFeb 20, 2018
On the Connection Between Learning Two-Layers Neural Networks and Tensor Decomposition

Marco Mondelli, Andrea Montanari

We establish connections between the problem of learning a two-layer neural network and tensor decomposition. We consider a model with feature vectors $\boldsymbol x \in \mathbb R^d$, $r$ hidden units with weights $\{\boldsymbol w_i\}_{1\le i \le r}$ and output $y\in \mathbb R$, i.e., $y=\sum_{i=1}^r σ( \boldsymbol w_i^{\mathsf T}\boldsymbol x)$, with activation functions given by low-degree polynomials. In particular, if $σ(x) = a_0+a_1x+a_3x^3$, we prove that no polynomial-time learning algorithm can outperform the trivial predictor that assigns to each example the response variable $\mathbb E(y)$, when $d^{3/2}\ll r\ll d^2$. Our conclusion holds for a `natural data distribution', namely standard Gaussian feature vectors $\boldsymbol x$, and output distributed according to a two-layer neural network with random isotropic weights, and under a certain complexity-theoretic assumption on tensor decomposition. Roughly speaking, we assume that no polynomial-time algorithm can substantially outperform current methods for tensor decomposition based on the sum-of-squares hierarchy. We also prove generalizations of this statement for higher degree polynomial activations, and non-random weight vectors. Remarkably, several existing algorithms for learning two-layer networks with rigorous guarantees are based on tensor decomposition. Our results support the idea that this is indeed the core computational difficulty in learning such networks, under the stated generative model for the data. As a side result, we show that under this model learning the network requires accurate learning of its weights, a property that does not hold in a more general setting.

MLFeb 2, 2018
An Instability in Variational Inference for Topic Models

Behrooz Ghorbani, Hamid Javadi, Andrea Montanari

Topic models are Bayesian models that are frequently used to capture the latent structure of certain corpora of documents or images. Each data element in such a corpus (for instance each item in a collection of scientific articles) is regarded as a convex combination of a small number of vectors corresponding to `topics' or `components'. The weights are assumed to have a Dirichlet prior distribution. The standard approach towards approximating the posterior is to use variational inference algorithms, and in particular a mean field approximation. We show that this approach suffers from an instability that can produce misleading conclusions. Namely, for certain regimes of the model parameters, variational inference outputs a non-trivial decomposition into topics. However --for the same parameter values-- the data contain no actual information about the true decomposition, and hence the output of the algorithm is uncorrelated with the true topic decomposition. Among other consequences, the estimated posterior mean is significantly wrong, and estimated Bayesian credible regions do not achieve the nominal coverage. We discuss how this instability is remedied by more accurate mean field approximations.

STNov 15, 2017
The landscape of the spiked tensor model

Gerard Ben Arous, Song Mei, Andrea Montanari et al.

We consider the problem of estimating a large rank-one tensor ${\boldsymbol u}^{\otimes k}\in({\mathbb R}^{n})^{\otimes k}$, $k\ge 3$ in Gaussian noise. Earlier work characterized a critical signal-to-noise ratio $λ_{Bayes}= O(1)$ above which an ideal estimator achieves strictly positive correlation with the unknown vector of interest. Remarkably no polynomial-time algorithm is known that achieved this goal unless $λ\ge C n^{(k-2)/4}$ and even powerful semidefinite programming relaxations appear to fail for $1\ll λ\ll n^{(k-2)/4}$. In order to elucidate this behavior, we consider the maximum likelihood estimator, which requires maximizing a degree-$k$ homogeneous polynomial over the unit sphere in $n$ dimensions. We compute the expected number of critical points and local maxima of this objective function and show that it is exponential in the dimensions $n$, and give exact formulas for the exponential growth rate. We show that (for $λ$ larger than a constant) critical points are either very close to the unknown vector ${\boldsymbol u}$, or are confined in a band of width $Θ(λ^{-1/(k-1)})$ around the maximum circle that is orthogonal to ${\boldsymbol u}$. For local maxima, this band shrinks to be of size $Θ(λ^{-1/(k-2)})$. These `uninformative' local maxima are likely to cause the failure of optimization algorithms.

STNov 6, 2017
Estimation of Low-Rank Matrices via Approximate Message Passing

Andrea Montanari, Ramji Venkataramanan

Consider the problem of estimating a low-rank matrix when its entries are perturbed by Gaussian noise. If the empirical distribution of the entries of the spikes is known, optimal estimators that exploit this knowledge can substantially outperform simple spectral approaches. Recent work characterizes the asymptotic accuracy of Bayes-optimal estimators in the high-dimensional limit. In this paper we present a practical algorithm that can achieve Bayes-optimal accuracy above the spectral threshold. A bold conjecture from statistical physics posits that no polynomial-time algorithm achieves optimal error below the same threshold (unless the best estimator is trivial). Our approach uses Approximate Message Passing (AMP) in conjunction with a spectral initialization. AMP algorithms have proved successful in a variety of statistical estimation tasks, and are amenable to exact asymptotic analysis via state evolution. Unfortunately, state evolution is uninformative when the algorithm is initialized near an unstable fixed point, as often happens in low-rank matrix estimation. We develop a new analysis of AMP that allows for spectral initializations. Our main theorem is general and applies beyond matrix estimation. However, we use it to derive detailed predictions for the problem of estimating a rank-one matrix in noise. Special cases of this problem are closely related---via universality arguments---to the network community detection problem for two asymmetric communities. For general rank-one models, we show that AMP can be used to construct confidence intervals and control false discovery rate. We provide illustrations of the general methodology by considering the cases of sparse low-rank matrices and of block-constant low-rank matrices with symmetric blocks (we refer to the latter as to the `Gaussian Block Model').

MLSep 19, 2017
Inference in Graphical Models via Semidefinite Programming Hierarchies

Murat A. Erdogdu, Yash Deshpande, Andrea Montanari

Maximum A posteriori Probability (MAP) inference in graphical models amounts to solving a graph-structured combinatorial optimization problem. Popular inference algorithms such as belief propagation (BP) and generalized belief propagation (GBP) are intimately related to linear programming (LP) relaxation within the Sherali-Adams hierarchy. Despite the popularity of these algorithms, it is well understood that the Sum-of-Squares (SOS) hierarchy based on semidefinite programming (SDP) can provide superior guarantees. Unfortunately, SOS relaxations for a graph with $n$ vertices require solving an SDP with $n^{Θ(d)}$ variables where $d$ is the degree in the hierarchy. In practice, for $d\ge 4$, this approach does not scale beyond a few tens of variables. In this paper, we propose binary SDP relaxations for MAP inference using the SOS hierarchy with two innovations focused on computational efficiency. Firstly, in analogy to BP and its variants, we only introduce decision variables corresponding to contiguous regions in the graphical model. Secondly, we solve the resulting SDP using a non-convex Burer-Monteiro style method, and develop a sequential rounding procedure. We demonstrate that the resulting algorithm can solve problems with tens of thousands of variables within minutes, and outperforms BP and GBP on practical problems such as image denoising and Ising spin glasses. Finally, for specific graph types, we establish a sufficient condition for the tightness of the proposed partial SOS relaxation.

MLAug 22, 2017
Learning Combinations of Sigmoids Through Gradient Estimation

Stratis Ioannidis, Andrea Montanari

We develop a new approach to learn the parameters of regression models with hidden variables. In a nutshell, we estimate the gradient of the regression function at a set of random points, and cluster the estimated gradients. The centers of the clusters are used as estimates for the parameters of hidden units. We justify this approach by studying a toy model, whereby the regression function is a linear combination of sigmoids. We prove that indeed the estimated gradients concentrate around the parameter vectors of the hidden units, and provide non-asymptotic bounds on the number of required samples. To the best of our knowledge, no comparable guarantees have been proven for linear combinations of sigmoids.

MLAug 20, 2017
Fundamental Limits of Weak Recovery with Applications to Phase Retrieval

Marco Mondelli, Andrea Montanari

In phase retrieval we want to recover an unknown signal $\boldsymbol x\in\mathbb C^d$ from $n$ quadratic measurements of the form $y_i = |\langle{\boldsymbol a}_i,{\boldsymbol x}\rangle|^2+w_i$ where $\boldsymbol a_i\in \mathbb C^d$ are known sensing vectors and $w_i$ is measurement noise. We ask the following weak recovery question: what is the minimum number of measurements $n$ needed to produce an estimator $\hat{\boldsymbol x}(\boldsymbol y)$ that is positively correlated with the signal $\boldsymbol x$? We consider the case of Gaussian vectors $\boldsymbol a_i$. We prove that - in the high-dimensional limit - a sharp phase transition takes place, and we locate the threshold in the regime of vanishingly small noise. For $n\le d-o(d)$ no estimator can do significantly better than random and achieve a strictly positive correlation. For $n\ge d+o(d)$ a simple spectral estimator achieves a positive correlation. Surprisingly, numerical simulations with the same spectral estimator demonstrate promising performance with realistic sensing matrices. Spectral methods are used to initialize non-convex optimization algorithms in phase retrieval, and our approach can boost the performance in this setting as well. Our impossibility result is based on classical information-theory arguments. The spectral algorithm computes the leading eigenvector of a weighted empirical covariance matrix. We obtain a sharp characterization of the spectral properties of this random matrix using tools from free probability and generalizing a recent result by Lu and Li. Both the upper and lower bound generalize beyond phase retrieval to measurements $y_i$ produced according to a generalized linear model. As a byproduct of our analysis, we compare the threshold of the proposed spectral method with that of a message passing algorithm.

MLMay 8, 2017
Non-negative Matrix Factorization via Archetypal Analysis

Hamid Javadi, Andrea Montanari

Given a collection of data points, non-negative matrix factorization (NMF) suggests to express them as convex combinations of a small set of `archetypes' with non-negative entries. This decomposition is unique only if the true archetypes are non-negative and sufficiently sparse (or the weights are sufficiently sparse), a regime that is captured by the separability condition and its generalizations. In this paper, we study an approach to NMF that can be traced back to the work of Cutler and Breiman (1994) and does not require the data to be separable, while providing a generally unique decomposition. We optimize the trade-off between two objectives: we minimize the distance of the data points from the convex envelope of the archetypes (which can be interpreted as an empirical risk), while minimizing the distance of the archetypes from the convex envelope of the data (which can be interpreted as a data-dependent regularization). The archetypal analysis method of (Cutler, Breiman, 1994) is recovered as the limiting case in which the last term is given infinite weight. We introduce a `uniqueness condition' on the data which is necessary for exactly recovering the archetypes from noiseless data. We prove that, under uniqueness (plus additional regularity conditions on the geometry of the archetypes), our estimator is robust. While our approach requires solving a non-convex optimization problem, we find that standard optimization methods succeed in finding good solutions both for real and synthetic data.

OCMar 25, 2017
Solving SDPs for synchronization and MaxCut problems via the Grothendieck inequality

Song Mei, Theodor Misiakiewicz, Andrea Montanari et al.

A number of statistical estimation problems can be addressed by semidefinite programs (SDP). While SDPs are solvable in polynomial time using interior point methods, in practice generic SDP solvers do not scale well to high-dimensional problems. In order to cope with this problem, Burer and Monteiro proposed a non-convex rank-constrained formulation, which has good performance in practice but is still poorly understood theoretically. In this paper we study the rank-constrained version of SDPs arising in MaxCut and in synchronization problems. We establish a Grothendieck-type inequality that proves that all the local maxima and dangerous saddle points are within a small multiplicative gap from the global maximum. We use this structural information to prove that SDPs can be solved within a known accuracy, by applying the Riemannian trust-region method to this non-convex problem, while constraining the rank to be of order one. For the MaxCut problem, our inequality implies that any local maximizer of the rank-constrained SDP provides a $ (1 - 1/(k-1)) \times 0.878$ approximation of the MaxCut, when the rank is fixed to $k$. We then apply our results to data matrices generated according to the Gaussian ${\mathbb Z}_2$ synchronization problem, and the two-groups stochastic block model with large bounded degree. We prove that the error achieved by local maximizers undergoes a phase transition at the same threshold as for information-theoretically optimal methods.

DSDec 23, 2016
Spectral algorithms for tensor completion

Andrea Montanari, Nike Sun

In the tensor completion problem, one seeks to estimate a low-rank tensor based on a random sample of revealed entries. In terms of the required sample size, earlier work revealed a large gap between estimation with unbounded computational resources (using, for instance, tensor nuclear norm minimization) and polynomial-time algorithms. Among the latter, the best statistical guarantees have been proved, for third-order tensors, using the sixth level of the sum-of-squares (SOS) semidefinite programming hierarchy (Barak and Moitra, 2014). However, the SOS approach does not scale well to large problem instances. By contrast, spectral methods --- based on unfolding or matricizing the tensor --- are attractive for their low complexity, but have been believed to require a much larger sample size. This paper presents two main contributions. First, we propose a new unfolding-based method, which outperforms naive ones for symmetric $k$-th order tensors of rank $r$. For this result we make a study of singular space estimation for partially revealed matrices of large aspect ratio, which may be of independent interest. For third-order tensors, our algorithm matches the SOS method in terms of sample size (requiring about $rd^{3/2}$ revealed entries), subject to a worse rank condition ($r\ll d^{3/4}$ rather than $r\ll d^{3/2}$). We complement this result with a different spectral algorithm for third-order tensors in the overcomplete ($r\ge d$) regime. Under a random model, this second approach succeeds in estimating tensors of rank $d\le r \ll d^{3/2}$ from about $rd^{3/2}$ revealed entries.

DMOct 17, 2016
How Well Do Local Algorithms Solve Semidefinite Programs?

Zhou Fan, Andrea Montanari

Several probabilistic models from high-dimensional statistics and machine learning reveal an intriguing --and yet poorly understood-- dichotomy. Either simple local algorithms succeed in estimating the object of interest, or even sophisticated semi-definite programming (SDP) relaxations fail. In order to explore this phenomenon, we study a classical SDP relaxation of the minimum graph bisection problem, when applied to Erdős-Renyi random graphs with bounded average degree $d>1$, and obtain several types of results. First, we use a dual witness construction (using the so-called non-backtracking matrix of the graph) to upper bound the SDP value. Second, we prove that a simple local algorithm approximately solves the SDP to within a factor $2d^2/(2d^2+d-1)$ of the upper bound. In particular, the local algorithm is at most $8/9$ suboptimal, and $1+O(1/d)$ suboptimal for large degree. We then analyze a more sophisticated local algorithm, which aggregates information according to the harmonic measure on the limiting Galton-Watson (GW) tree. The resulting lower bound is expressed in terms of the conductance of the GW tree and matches surprisingly well the empirically determined SDP values on large-scale Erdős-Renyi graphs. We finally consider the planted partition model. In this case, purely local algorithms are known to fail, but they do succeed if a small amount of side information is available. Our results imply quantitative bounds on the threshold for partial recovery using SDP in this model.