MLMar 3, 2023
Asymptotic Bayes risk of semi-supervised multitask learning on Gaussian mixtureMinh-Toan Nguyen, Romain Couillet
The article considers semi-supervised multitask learning on a Gaussian mixture model (GMM). Using methods from statistical physics, we compute the asymptotic Bayes risk of each task in the regime of large datasets in high dimension, from which we analyze the role of task similarity in learning and evaluate the performance gain when tasks are learned together rather than separately. In the supervised case, we derive a simple algorithm that attains the Bayes optimal performance.
MLFeb 5, 2024
A Random Matrix Approach to Low-Multilinear-Rank Tensor ApproximationHugo Lebeau, Florent Chatelain, Romain Couillet
This work presents a comprehensive understanding of the estimation of a planted low-rank signal from a general spiked tensor model near the computational threshold. Relying on standard tools from the theory of large random matrices, we characterize the large-dimensional spectral behavior of the unfoldings of the data tensor and exhibit relevant signal-to-noise ratios governing the detectability of the principal directions of the signal. These results allow to accurately predict the reconstruction performance of truncated multilinear SVD (MLSVD) in the non-trivial regime. This is particularly important since it serves as an initialization of the higher-order orthogonal iteration (HOOI) scheme, whose convergence to the best low-multilinear-rank approximation depends entirely on its initialization. We give a sufficient condition for the convergence of HOOI and show that the number of iterations before convergence tends to $1$ in the large-dimensional limit.
MLFeb 21, 2024
A Large Dimensional Analysis of Multi-task Semi-Supervised LearningVictor Leger, Romain Couillet
This article conducts a large dimensional study of a simple yet quite versatile classification model, encompassing at once multi-task and semi-supervised learning, and taking into account uncertain labeling. Using tools from random matrix theory, we characterize the asymptotics of some key functionals, which allows us on the one hand to predict the performances of the algorithm, and on the other hand to reveal some counter-intuitive guidance on how to use it efficiently. The model, powerful enough to provide good performance guarantees, is also straightforward enough to provide strong insights into its behavior.
MLFeb 19, 2024
Asymptotic Gaussian Fluctuations of Eigenvectors in Spectral ClusteringHugo Lebeau, Florent Chatelain, Romain Couillet
The performance of spectral clustering relies on the fluctuations of the entries of the eigenvectors of a similarity matrix, which has been left uncharacterized until now. In this letter, it is shown that the signal $+$ noise structure of a general spike random matrix model is transferred to the eigenvectors of the corresponding Gram kernel matrix and the fluctuations of their entries are Gaussian in the large-dimensional regime. This CLT-like result was the last missing piece to precisely predict the classification performance of spectral clustering. The proposed proof is very general and relies solely on the rotational invariance of the noise. Numerical experiments on synthetic and real data illustrate the universality of this phenomenon.
MLMar 26, 2024
Asymptotic Bayes risk of semi-supervised learning with uncertain labelingVictor Leger, Romain Couillet
This article considers a semi-supervised classification setting on a Gaussian mixture model, where the data is not labeled strictly as usual, but instead with uncertain labels. Our main aim is to compute the Bayes risk for this model. We compare the behavior of the Bayes risk and the best known algorithm for this model. This comparison eventually gives new insights over the algorithm.
PRDec 23, 2021
When Random Tensors meet Random MatricesMohamed El Amine Seddik, Maxime Guillaud, Romain Couillet
Relying on random matrix theory (RMT), this paper studies asymmetric order-$d$ spiked tensor models with Gaussian noise. Using the variational definition of the singular vectors and values of (Lim, 2005), we show that the analysis of the considered model boils down to the analysis of an equivalent spiked symmetric \textit{block-wise} random matrix, that is constructed from \textit{contractions} of the studied tensor with the singular vectors associated to its best rank-1 approximation. Our approach allows the exact characterization of the almost sure asymptotic singular value and alignments of the corresponding singular vectors with the true spike components, when $\frac{n_i}{\sum_{j=1}^d n_j}\to c_i\in (0, 1)$ with $n_i$'s the tensor dimensions. In contrast to other works that rely mostly on tools from statistical physics to study random tensors, our results rely solely on classical RMT tools such as Stein's lemma. Finally, classical RMT results concerning spiked random matrices are recovered as a particular case.
MLNov 1, 2021
PCA-based Multi Task Learning: a Random Matrix ApproachMalik Tiomoko, Romain Couillet, Frédéric Pascal
The article proposes and theoretically analyses a \emph{computationally efficient} multi-task learning (MTL) extension of popular principal component analysis (PCA)-based supervised learning schemes \cite{barshan2011supervised,bair2006prediction}. The analysis reveals that (i) by default learning may dramatically fail by suffering from \emph{negative transfer}, but that (ii) simple counter-measures on data labels avert negative transfer and necessarily result in improved performances. Supporting experiments on synthetic and real data benchmarks show that the proposed method achieves comparable performance with state-of-the-art MTL methods but at a \emph{significantly reduced computational cost}.
LGOct 9, 2021
Multi-task learning on the edge: cost-efficiency and theoretical optimalitySami Fakhry, Romain Couillet, Malik Tiomoko
This article proposes a distributed multi-task learning (MTL) algorithm based on supervised principal component analysis (SPCA) which is: (i) theoretically optimal for Gaussian mixtures, (ii) computationally cheap and scalable. Supporting experiments on synthetic and real benchmark data demonstrate that significant energy gains can be obtained with no performance loss.
MLOct 5, 2021
Random matrices in service of ML footprint: ternary random features with no performance lossHafiz Tiomoko Ali, Zhenyu Liao, Romain Couillet
In this article, we investigate the spectral behavior of random features kernel matrices of the type ${\bf K} = \mathbb{E}_{\bf w} \left[σ\left({\bf w}^{\sf T}{\bf x}_i\right)σ\left({\bf w}^{\sf T}{\bf x}_j\right)\right]_{i,j=1}^n$, with nonlinear function $σ(\cdot)$, data ${\bf x}_1, \ldots, {\bf x}_n \in \mathbb{R}^p$, and random projection vector ${\bf w} \in \mathbb{R}^p$ having i.i.d. entries. In a high-dimensional setting where the number of data $n$ and their dimension $p$ are both large and comparable, we show, under a Gaussian mixture model for the data, that the eigenspectrum of ${\bf K}$ is independent of the distribution of the i.i.d.(zero-mean and unit-variance) entries of ${\bf w}$, and only depends on $σ(\cdot)$ via its (generalized) Gaussian moments $\mathbb{E}_{z\sim \mathcal N(0,1)}[σ'(z)]$ and $\mathbb{E}_{z\sim \mathcal N(0,1)}[σ''(z)]$. As a result, for any kernel matrix ${\bf K}$ of the form above, we propose a novel random features technique, called Ternary Random Feature (TRF), that (i) asymptotically yields the same limiting kernel as the original ${\bf K}$ in a spectral sense and (ii) can be computed and stored much more efficiently, by wisely tuning (in a data-dependent manner) the function $σ$ and the random vector ${\bf w}$, both taking values in $\{-1,0,1\}$. The computation of the proposed random features requires no multiplication, and a factor of $b$ times less bits for storage compared to classical random features such as random Fourier features, with $b$ the number of bits to store full precision values. Besides, it appears in our experiments on real data that the substantial gains in computation and storage are accompanied with somewhat improved performances compared to state-of-the-art random features compression/quantization methods.
PRSep 6, 2021
Spectral properties of sample covariance matrices arising from random matrices with independent non identically distributed columnsCosme Louart, Romain Couillet
Given a random matrix $X= (x_1,\ldots, x_n)\in \mathcal M_{p,n}$ with independent columns and satisfying concentration of measure hypotheses and a parameter $z$ whose distance to the spectrum of $\frac{1}{n} XX^T$ should not depend on $p,n$, it was previously shown that the functionals $\text{tr}(AR(z))$, for $R(z) = (\frac{1}{n}XX^T- zI_p)^{-1}$ and $A\in \mathcal M_{p}$ deterministic, have a standard deviation of order $O(\|A\|_* / \sqrt n)$. Here, we show that $\|\mathbb E[R(z)] - \tilde R(z)\|_F \leq O(1/\sqrt n)$, where $\tilde R(z)$ is a deterministic matrix depending only on $z$ and on the means and covariances of the column vectors $x_1,\ldots, x_n$ (that do not have to be identically distributed). This estimation is key to providing accurate fluctuation rates of functionals of $X$ of interest (mostly related to its spectral properties) and is proved thanks to the introduction of a semi-metric $d_s$ defined on the set $\mathcal D_n(\mathbb H)$ of diagonal matrices with complex entries and positive imaginary part and satisfying, for all $D,D' \in \mathcal D_n(\mathbb H)$: $d_s(D,D') = \max_{i\in[n]} |D_i - D_i'|/ (\Im(D_i) \Im(D_i'))^{1/2}$. Possibly most importantly, the underlying concentration of measure assumption on the columns of $X$ finds an extremely natural ground for application in modern statistical machine learning algorithms where non-linear Lipschitz mappings and high number of classes form the base ingredients.
MLAug 2, 2021
A Random Matrix Perspective on Random TensorsJosé Henrique de Morais Goulart, Romain Couillet, Pierre Comon
Tensor models play an increasingly prominent role in many fields, notably in machine learning. In several applications, such as community detection, topic modeling and Gaussian mixture learning, one must estimate a low-rank signal from a noisy tensor. Hence, understanding the fundamental limits of estimators of that signal inevitably calls for the study of random tensors. Substantial progress has been recently achieved on this subject in the large-dimensional limit. Yet, some of the most significant among these results--in particular, a precise characterization of the abrupt phase transition (with respect to signal-to-noise ratio) that governs the performance of the maximum likelihood (ML) estimator of a symmetric rank-one model with Gaussian noise--were derived based of mean-field spin glass theory, which is not easily accessible to non-experts. In this work, we develop a sharply distinct and more elementary approach, relying on standard but powerful tools brought by years of advances in random matrix theory. The key idea is to study the spectra of random matrices arising from contractions of a given random tensor. We show how this gives access to spectral properties of the random tensor itself. For the aforementioned rank-one model, our technique yields a hitherto unknown fixed-point equation whose solution precisely matches the asymptotic performance of the ML estimator above the phase transition threshold in the third-order case. A numerical verification provides evidence that the same holds for orders 4 and 5, leading us to conjecture that, for any order, our fixed-point equation is equivalent to the known characterization of the ML estimation performance that had been obtained by relying on spin glasses. Moreover, our approach sheds light on certain properties of the ML problem landscape in large dimensions and can be extended to other models, such as asymmetric and non-Gaussian.
MLMar 5, 2021
Nishimori meets Bethe: a spectral method for node classification in sparse weighted graphsLorenzo Dall'Amico, Romain Couillet, Nicolas Tremblay
This article unveils a new relation between the Nishimori temperature parametrizing a distribution P and the Bethe free energy on random Erdos-Renyi graphs with edge weights distributed according to P. Estimating the Nishimori temperature being a task of major importance in Bayesian inference problems, as a practical corollary of this new relation, a numerical method is proposed to accurately estimate the Nishimori temperature from the eigenvalues of the Bethe Hessian matrix of the weighted graph. The algorithm, in turn, is used to propose a new spectral method for node classification in weighted (possibly sparse) graphs. The superiority of the method over competing state-of-the-art approaches is demonstrated both through theoretical arguments and real-world data experiments.
LGFeb 24, 2021
Two-way kernel matrix puncturing: towards resource-efficient PCA and spectral clusteringRomain Couillet, Florent Chatelain, Nicolas Le Bihan
The article introduces an elementary cost and storage reduction method for spectral clustering and principal component analysis. The method consists in randomly "puncturing" both the data matrix $X\in\mathbb{C}^{p\times n}$ (or $\mathbb{R}^{p\times n}$) and its corresponding kernel (Gram) matrix $K$ through Bernoulli masks: $S\in\{0,1\}^{p\times n}$ for $X$ and $B\in\{0,1\}^{n\times n}$ for $K$. The resulting "two-way punctured" kernel is thus given by $K=\frac{1}{p}[(X \odot S)^{\sf H} (X \odot S)] \odot B$. We demonstrate that, for $X$ composed of independent columns drawn from a Gaussian mixture model, as $n,p\to\infty$ with $p/n\to c_0\in(0,\infty)$, the spectral behavior of $K$ -- its limiting eigenvalue distribution, as well as its isolated eigenvalues and eigenvectors -- is fully tractable and exhibits a series of counter-intuitive phenomena. We notably prove, and empirically confirm on GAN-generated image databases, that it is possible to drastically puncture the data, thereby providing possibly huge computational and storage gains, for a virtually constant (clustering of PCA) performance. This preliminary study opens as such the path towards rethinking, from a large dimensional standpoint, computational and storage costs in elementary machine learning models.
PRFeb 16, 2021
Concentration of measure and generalized product of random vectors with an application to Hanson-Wright-like inequalitiesCosme Louart, Romain Couillet
Starting from concentration of measure hypotheses on $m$ random vectors $Z_1,\ldots, Z_m$, this article provides an expression of the concentration of functionals $φ(Z_1,\ldots, Z_m)$ where the variations of $φ$ on each variable depend on the product of the norms (or semi-norms) of the other variables (as if $φ$ were a product). We illustrate the importance of this result through various generalizations of the Hanson-Wright concentration inequality as well as through a study of the random matrix $XDX^T$ and its resolvent $Q = (I_p - \frac{1}{n}XDX^T)^{-1}$, where $X$ and $D$ are random, which have fundamental interest in statistical machine learning applications.
MLOct 3, 2020
Sparse Quantized Spectral ClusteringZhenyu Liao, Romain Couillet, Michael W. Mahoney
Given a large data matrix, sparsifying, quantizing, and/or performing other entry-wise nonlinear operations can have numerous benefits, ranging from speeding up iterative algorithms for core numerical linear algebra problems to providing nonlinear filters to design state-of-the-art neural network models. Here, we exploit tools from random matrix theory to make precise statements about how the eigenspectrum of a matrix changes under such nonlinear transformations. In particular, we show that very little change occurs in the informative eigenstructure even under drastic sparsification/quantization, and consequently that very little downstream performance loss occurs with very aggressively sparsified or quantized spectral clustering. We illustrate how these results depend on the nonlinearity, we characterize a phase transition beyond which spectral clustering becomes possible, and we show when such nonlinear transformations can introduce spurious non-informative eigenvectors.
MLSep 3, 2020
Large Dimensional Analysis and Improvement of Multi Task LearningMalik Tiomoko, Romain Couillet, Hafiz Tiomoko
Multi Task Learning (MTL) efficiently leverages useful information contained in multiple related tasks to help improve the generalization performance of all tasks. This article conducts a large dimensional analysis of a simple but, as we shall see, extremely powerful when carefully tuned, Least Square Support Vector Machine (LSSVM) version of MTL, in the regime where the dimension $p$ of the data and their number $n$ grow large at the same rate. Under mild assumptions on the input data, the theoretical analysis of the MTL-LSSVM algorithm first reveals the "sufficient statistics" exploited by the algorithm and their interaction at work. These results demonstrate, as a striking consequence, that the standard approach to MTL-LSSVM is largely suboptimal, can lead to severe effects of negative transfer but that these impairments are easily corrected. These corrections are turned into an improved MTL-LSSVM algorithm which can only benefit from additional data, and the theoretical performance of which is also analyzed. As evidenced and theoretically sustained in numerous recent works, these large dimensional results are robust to broad ranges of data distributions, which our present experiments corroborate. Specifically, the article reports a systematically close behavior between theoretical and empirical performances on popular datasets, which is strongly suggestive of the applicability of the proposed carefully tuned MTL-LSSVM method to real data. This fine-tuning is fully based on the theoretical analysis and does not in particular require any cross validation procedure. Besides, the reported performances on real datasets almost systematically outperform much more elaborate and less intuitive state-of-the-art multi-task and transfer learning methods.
PRJun 17, 2020
A Concentration of Measure and Random Matrix Approach to Large Dimensional Robust StatisticsCosme Louart, Romain Couillet
This article studies the \emph{robust covariance matrix estimation} of a data collection $X = (x_1,\ldots,x_n)$ with $x_i = \sqrt τ_i z_i + m$, where $z_i \in \mathbb R^p$ is a \textit{concentrated vector} (e.g., an elliptical random vector), $m\in \mathbb R^p$ a deterministic signal and $τ_i\in \mathbb R$ a scalar perturbation of possibly large amplitude, under the assumption where both $n$ and $p$ are large. This estimator is defined as the fixed point of a function which we show is contracting for a so-called \textit{stable semi-metric}. We exploit this semi-metric along with concentration of measure arguments to prove the existence and uniqueness of the robust estimator as well as evaluate its limiting spectral distribution.
LGJun 13, 2020
Consistent Semi-Supervised Graph Regularization for High Dimensional DataXiaoyi Mai, Romain Couillet
Semi-supervised Laplacian regularization, a standard graph-based approach for learning from both labelled and unlabelled data, was recently demonstrated to have an insignificant high dimensional learning efficiency with respect to unlabelled data (Mai and Couillet 2018), causing it to be outperformed by its unsupervised counterpart, spectral clustering, given sufficient unlabelled data. Following a detailed discussion on the origin of this inconsistency problem, a novel regularization approach involving centering operation is proposed as solution, supported by both theoretical analysis and empirical results.
MLJun 9, 2020
A Random Matrix Analysis of Random Fourier Features: Beyond the Gaussian Kernel, a Precise Phase Transition, and the Corresponding Double DescentZhenyu Liao, Romain Couillet, Michael W. Mahoney
This article characterizes the exact asymptotics of random Fourier feature (RFF) regression, in the realistic setting where the number of data samples $n$, their dimension $p$, and the dimension of feature space $N$ are all large and comparable. In this regime, the random RFF Gram matrix no longer converges to the well-known limiting Gaussian kernel matrix (as it does when $N \to \infty$ alone), but it still has a tractable behavior that is captured by our analysis. This analysis also provides accurate estimates of training and test regression errors for large $n,p,N$. Based on these estimates, a precise characterization of two qualitatively different phases of learning, including the phase transition between them, is provided; and the corresponding double descent test error curve is derived from this phase transition behavior. These results do not depend on strong assumptions on the data distribution, and they perfectly match empirical results on real-world data sets.
SIJun 3, 2020
Community detection in sparse time-evolving graphs with a dynamical Bethe-HessianLorenzo Dall'Amico, Romain Couillet, Nicolas Tremblay
This article considers the problem of community detection in sparse dynamical graphs in which the community structure evolves over time. A fast spectral algorithm based on an extension of the Bethe-Hessian matrix is proposed, which benefits from the positive correlation in the class labels and in their temporal evolution and is designed to be applicable to any dynamical graph with a community structure. Under the dynamical degree-corrected stochastic block model, in the case of two classes of equal size, we demonstrate and support with extensive simulations that our proposed algorithm is capable of making non-trivial community reconstruction as soon as theoretically possible, thereby reaching the optimal detectability threshold and provably outperforming competing spectral methods.
MLMar 20, 2020
A unified framework for spectral clustering in sparse graphsLorenzo Dall'Amico, Romain Couillet, Nicolas Tremblay
This article considers spectral community detection in the regime of sparse networks with heterogeneous degree distributions, for which we devise an algorithm to efficiently retrieve communities. Specifically, we demonstrate that a conveniently parametrized form of regularized Laplacian matrix can be used to perform spectral clustering in sparse networks, without suffering from its degree heterogeneity. Besides, we exhibit important connections between this proposed matrix and the now popular non-backtracking matrix, the Bethe-Hessian matrix, as well as the standard Laplacian matrix. Interestingly, as opposed to competitive methods, our proposed improved parametrization inherently accounts for the hardness of the classification problem. These findings are summarized under the form of an algorithm capable of both estimating the number of communities and achieving high-quality community reconstruction.
LGJan 21, 2020
Random Matrix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian MixturesMohamed El Amine Seddik, Cosme Louart, Mohamed Tamaazousti et al.
This paper shows that deep learning (DL) representations of data produced by generative adversarial nets (GANs) are random vectors which fall within the class of so-called \textit{concentrated} random vectors. Further exploiting the fact that Gram matrices, of the type $G = X^T X$ with $X=[x_1,\ldots,x_n]\in \mathbb{R}^{p\times n}$ and $x_i$ independent concentrated random vectors from a mixture model, behave asymptotically (as $n,p\to \infty$) as if the $x_i$ were drawn from a Gaussian mixture, suggests that DL representations of GAN-data can be fully described by their first two statistical moments for a wide range of standard classifiers. Our theoretical findings are validated by generating images with the BigGAN model and across different popular deep representation networks.
LGDec 3, 2019
Optimal Laplacian regularization for sparse spectral community detectionLorenzo Dall'Amico, Romain Couillet, Nicolas Tremblay
Regularization of the classical Laplacian matrices was empirically shown to improve spectral clustering in sparse networks. It was observed that small regularizations are preferable, but this point was left as a heuristic argument. In this paper we formally determine a proper regularization which is intimately related to alternative state-of-the-art spectral techniques for sparse graphs.
MLSep 15, 2019
Inner-product Kernels are Asymptotically Equivalent to Binary Discrete KernelsZhenyu Liao, Romain Couillet
This article investigates the eigenspectrum of the inner product-type kernel matrix $\sqrt{p} \mathbf{K}=\{f( \mathbf{x}_i^{\sf T} \mathbf{x}_j/\sqrt{p})\}_{i,j=1}^n $ under a binary mixture model in the high dimensional regime where the number of data $n$ and their dimension $p$ are both large and comparable. Based on recent advances in random matrix theory, we show that, for a wide range of nonlinear functions $f$, the eigenspectrum behavior is asymptotically equivalent to that of an (at most) cubic function. This sheds new light on the understanding of nonlinearity in large dimensional problems. As a byproduct, we propose a simple function prototype valued in $ (-1,0,1) $ that, while reducing both storage memory and running time, achieves the same (asymptotic) classification performance as any arbitrary function $f$.
MLMar 8, 2019
Random Matrix-Improved Estimation of the Wasserstein Distance between two Centered Gaussian DistributionsMalik Tiomoko, Romain Couillet
This article proposes a method to consistently estimate functionals $\frac1p\sum_{i=1}^pf(λ_i(C_1C_2))$ of the eigenvalues of the product of two covariance matrices $C_1,C_2\in\mathbb{R}^{p\times p}$ based on the empirical estimates $λ_i(\hat C_1\hat C_2)$ ($\hat C_a=\frac1{n_a}\sum_{i=1}^{n_a} x_i^{(a)}x_i^{(a){\sf T}}$), when the size $p$ and number $n_a$ of the (zero mean) samples $x_i^{(a)}$ are similar. As a corollary, a consistent estimate of the Wasserstein distance (related to the case $f(t)=\sqrt{t}$) between centered Gaussian distributions is derived. The new estimate is shown to largely outperform the classical sample covariance-based `plug-in' estimator. Based on this finding, a practical application to covariance estimation is then devised which demonstrates potentially significant performance gains with respect to state-of-the-art alternatives.
MLFeb 7, 2019
Random Matrix Improved Covariance Estimation for a Large Class of MetricsMalik Tiomoko, Florent Bouchard, Guillaume Ginholac et al.
Relying on recent advances in statistical estimation of covariance distances based on random matrix theory, this article proposes an improved covariance and precision matrix estimation for a wide family of metrics. The method is shown to largely outperform the sample covariance matrix estimate and to compete with state-of-the-art methods, while at the same time being computationally simpler. Applications to linear and quadratic discriminant analyses also demonstrate significant gains, therefore suggesting practical interest to statistical machine learning.
SIJan 25, 2019
Revisiting the Bethe-Hessian: Improved Community Detection in Sparse Heterogeneous GraphsLorenzo Dall'Amico, Romain Couillet, Nicolas Tremblay
Spectral clustering is one of the most popular, yet still incompletely understood, methods for community detection on graphs. This article studies spectral clustering based on the Bethe-Hessian matrix $H_r = (r^2-1)I_n + D-rA$ for sparse heterogeneous graphs (following the degree-corrected stochastic block model) in a two-class setting. For a specific value $r = ζ$, clustering is shown to be insensitive to the degree heterogeneity. We then study the behavior of the informative eigenvector of $H_ζ$ and, as a result, predict the clustering accuracy. The article concludes with an overview of the generalization to more than two classes along with extensive simulations on synthetic and real networks corroborating our findings.
LGNov 8, 2018
A Geometric Approach of Gradient Descent Algorithms in Linear Neural NetworksYacine Chitour, Zhenyu Liao, Romain Couillet
In this paper, we propose a geometric framework to analyze the convergence properties of gradient descent trajectories in the context of linear neural networks. We translate a well-known empirical observation of linear neural nets into a conjecture that we call the \emph{overfitting conjecture} which states that, for almost all training data and initial conditions, the trajectory of the corresponding gradient descent system converges to a global minimum. This would imply that the solution achieved by vanilla gradient descent algorithms is equivalent to that of the least-squares estimation, for linear neural networks of an arbitrary number of hidden layers. Built upon a key invariance property induced by the network structure, we first establish convergence of gradient descent trajectories to critical points of the square loss function in the case of linear networks of arbitrary depth. Our second result is the proof of the \emph{overfitting conjecture} in the case of single-hidden-layer linear networks with an argument based on the notion of normal hyperbolicity and under a generic property on the training data (i.e., holding for almost all training data).
PROct 10, 2018
Random matrix-improved estimation of covariance matrix distancesRomain Couillet, Malik Tiomoko, Steeve Zozor et al.
Given two sets $x_1^{(1)},\ldots,x_{n_1}^{(1)}$ and $x_1^{(2)},\ldots,x_{n_2}^{(2)}\in\mathbb{R}^p$ (or $\mathbb{C}^p$) of random vectors with zero mean and positive definite covariance matrices $C_1$ and $C_2\in\mathbb{R}^{p\times p}$ (or $\mathbb{C}^{p\times p}$), respectively, this article provides novel estimators for a wide range of distances between $C_1$ and $C_2$ (along with divergences between some zero mean and covariance $C_1$ or $C_2$ probability measures) of the form $\frac1p\sum_{i=1}^n f(λ_i(C_1^{-1}C_2))$ (with $λ_i(X)$ the eigenvalues of matrix $X$). These estimators are derived using recent advances in the field of random matrix theory and are asymptotically consistent as $n_1,n_2,p\to\infty$ with non trivial ratios $p/n_1<1$ and $p/n_2<1$ (the case $p/n_2>1$ is also discussed). A first "generic" estimator, valid for a large set of $f$ functions, is provided under the form of a complex integral. Then, for a selected set of $f$'s of practical interest (namely, $f(t)=t$, $f(t)=\log(t)$, $f(t)=\log(1+st)$ and $f(t)=\log^2(t)$), a closed-form expression is provided. Beside theoretical findings, simulation results suggest an outstanding performance advantage for the proposed estimators when compared to the classical "plug-in" estimator $\frac1p\sum_{i=1}^n f(λ_i(\hat C_1^{-1}\hat C_2))$ (with $\hat C_a=\frac1{n_a}\sum_{i=1}^{n_a}x_i^{(a)}x_i^{(a){\sf T}}$), and this even for very small values of $n_1,n_2,p$.
SIJun 16, 2018
Latent heterogeneous multilayer community detectionHafiz Tiomoko Ali, Sijia Liu, Yasin Yilmaz et al.
We propose a method for simultaneously detecting shared and unshared communities in heterogeneous multilayer weighted and undirected networks. The multilayer network is assumed to follow a generative probabilistic model that takes into account the similarities and dissimilarities between the communities. We make use of a variational Bayes approach for jointly inferring the shared and unshared hidden communities from multilayer network observations. We show that our approach outperforms state-of-the-art algorithms in detecting disparate (shared and private) communities on synthetic data as well as on real genome-wide fibroblast proliferation dataset.
MLMay 30, 2018
The Dynamics of Learning: A Random Matrix ApproachZhenyu Liao, Romain Couillet
Understanding the learning dynamics of neural networks is one of the key issues for the improvement of optimization algorithms as well as for the theoretical comprehension of why deep neural nets work so well today. In this paper, we introduce a random matrix-based framework to analyze the learning dynamics of a single-layer linear network on a binary classification problem, for data of simultaneously large dimension and size, trained by gradient descent. Our results provide rich insights into common questions in neural nets, such as overfitting, early stopping and the initialization of training, thereby opening the door for future studies of more elaborate structures and models appearing in today's neural networks.
MLMay 30, 2018
On the Spectrum of Random Features Maps of High Dimensional DataZhenyu Liao, Romain Couillet
Random feature maps are ubiquitous in modern statistical machine learning, where they generalize random projections by means of powerful, yet often difficult to analyze nonlinear operators. In this paper, we leverage the "concentration" phenomenon induced by random matrix theory to perform a spectral analysis on the Gram matrix of these random feature maps, here for Gaussian mixture models of simultaneously large dimension and size. Our results are instrumental to a deeper understanding on the interplay of the nonlinearity and the statistics of the data, thereby allowing for a better tuning of random feature-based techniques.
LGNov 9, 2017
A random matrix analysis and improvement of semi-supervised learning for large dimensional dataXiaoyi Mai, Romain Couillet
This article provides an original understanding of the behavior of a class of graph-oriented semi-supervised learning algorithms in the limit of large and numerous data. It is demonstrated that the intuition at the root of these methods collapses in this limit and that, as a result, most of them become inconsistent. Corrective measures and a new data-driven parametrization scheme are proposed along with a theoretical analysis of the asymptotic performances of the resulting approach. A surprisingly close behavior between theoretical performances on Gaussian mixture models and on real datasets is also illustrated throughout the article, thereby suggesting the importance of the proposed analysis for dealing with practical data. As a result, significant performance gains are observed on practical data classification using the proposed parametrization.
MLNov 1, 2017
A Large Dimensional Study of Regularized Discriminant Analysis ClassifiersKhalil Elkhalil, Abla Kammoun, Romain Couillet et al.
This article carries out a large dimensional analysis of standard regularized discriminant analysis classifiers designed on the assumption that data arise from a Gaussian mixture model with different means and covariances. The analysis relies on fundamental results from random matrix theory (RMT) when both the number of features and the cardinality of the training data within each class grow large at the same pace. Under mild assumptions, we show that the asymptotic classification error approaches a deterministic quantity that depends only on the means and covariances associated with each class as well as the problem dimensions. Such a result permits a better understanding of the performance of regularized discriminant analsysis, in practical large but finite dimensions, and can be used to determine and pre-estimate the optimal regularization parameter that minimizes the misclassification error probability. Despite being theoretically valid only for Gaussian data, our findings are shown to yield a high accuracy in predicting the performances achieved with real data sets drawn from the popular USPS data base, thereby making an interesting connection between theory and practice.
PRFeb 17, 2017
A Random Matrix Approach to Neural NetworksCosme Louart, Zhenyu Liao, Romain Couillet
This article studies the Gram random matrix model $G=\frac1TΣ^{\rm T}Σ$, $Σ=σ(WX)$, classically found in the analysis of random feature maps and random neural networks, where $X=[x_1,\ldots,x_T]\in{\mathbb R}^{p\times T}$ is a (data) matrix of bounded norm, $W\in{\mathbb R}^{n\times p}$ is a matrix of independent zero-mean unit variance entries, and $σ:{\mathbb R}\to{\mathbb R}$ is a Lipschitz continuous (activation) function --- $σ(WX)$ being understood entry-wise. By means of a key concentration of measure lemma arising from non-asymptotic random matrix arguments, we prove that, as $n,p,T$ grow large at the same rate, the resolvent $Q=(G+γI_T)^{-1}$, for $γ>0$, has a similar behavior as that met in sample covariance matrix models, involving notably the moment $Φ=\frac{T}n{\mathbb E}[G]$, which provides in passing a deterministic equivalent for the empirical spectral measure of $G$. Application-wise, this result enables the estimation of the asymptotic performance of single-layer random neural networks. This in turn provides practical insights into the underlying mechanisms into play in random neural networks, entailing several unexpected consequences, as well as a fast practical means to tune the network hyperparameters.
MLJan 11, 2017
A Large Dimensional Analysis of Least Squares Support Vector MachinesZhenyu Liao, Romain Couillet
In this article, a large dimensional performance analysis of kernel least squares support vector machines (LS-SVMs) is provided under the assumption of a two-class Gaussian mixture model for the input data. Building upon recent advances in random matrix theory, we show, when the dimension of data $p$ and their number $n$ are both large, that the LS-SVM decision function can be well approximated by a normally distributed random variable, the mean and variance of which depend explicitly on a local behavior of the kernel function. This theoretical result is then applied to the MNIST and Fashion-MNIST datasets which, despite their non-Gaussianity, exhibit a convincingly close behavior. Most importantly, our analysis provides a deeper understanding of the mechanism into play in SVM-type methods and in particular of the impact on the choice of the kernel function as well as some of their theoretical limits in separating high dimensional Gaussian vectors.
MLNov 3, 2016
Spectral community detection in heterogeneous large networksHafiz Tiomoko Ali, Romain Couillet
In this article, we study spectral methods for community detection based on $ α$-parametrized normalized modularity matrix hereafter called $ {\bf L}_α$ in heterogeneous graph models. We show, in a regime where community detection is not asymptotically trivial, that $ {\bf L}_α$ can be well approximated by a more tractable random matrix which falls in the family of spiked random matrices. The analysis of this equivalent spiked random matrix allows us to improve spectral methods for community detection and assess their performances in the regime under study. In particular, we prove the existence of an optimal value $ α_{\rm opt} $ of the parameter $ α$ for which the detection of communities is best ensured and we provide an on-line estimation of $ α_{\rm opt} $ only based on the knowledge of the graph adjacency matrix. Unlike classical spectral methods for community detection where clustering is performed on the eigenvectors associated with extreme eigenvalues, we show through our theoretical analysis that a regularization should instead be performed on those eigenvectors prior to clustering in heterogeneous graphs. Finally, through a deeper study of the regularized eigenvectors used for clustering, we assess the performances of our new algorithm for community detection. Numerical simulations in the course of the article show that our methods outperform state-of-the-art spectral methods on dense heterogeneous graphs.
MLSep 7, 2016
Random matrices meet machine learning: a large dimensional analysis of LS-SVMZhenyu Liao, Romain Couillet
This article proposes a performance analysis of kernel least squares support vector machines (LS-SVMs) based on a random matrix approach, in the regime where both the dimension of data $p$ and their number $n$ grow large at the same rate. Under a two-class Gaussian mixture model for the input data, we prove that the LS-SVM decision function is asymptotically normal with means and covariances shown to depend explicitly on the derivatives of the kernel function. This provides improved understanding along with new insights into the internal workings of SVM-type methods for large datasets.
LGMar 25, 2016
The Asymptotic Performance of Linear Echo State Neural NetworksRomain Couillet, Gilles Wainrib, Harry Sevi et al.
In this article, a study of the mean-square error (MSE) performance of linear echo-state neural networks is performed, both for training and testing tasks. Considering the realistic setting of noise present at the network nodes, we derive deterministic equivalents for the aforementioned MSE in the limit where the number of input data $T$ and network size $n$ both grow large. Specializing then the network connectivity matrix to specific random settings, we further obtain simple formulas that provide new insights on the performance of such networks.
STMar 4, 2015
Large Dimensional Analysis of Robust M-Estimators of Covariance with OutliersDavid Morales-Jimenez, Romain Couillet, Matthew R. McKay
A large dimensional characterization of robust M-estimators of covariance (or scatter) is provided under the assumption that the dataset comprises independent (essentially Gaussian) legitimate samples as well as arbitrary deterministic samples, referred to as outliers. Building upon recent random matrix advances in the area of robust statistics, we specifically show that the so-called Maronna M-estimator of scatter asymptotically behaves similar to well-known random matrices when the population and sample sizes grow together to infinity. The introduction of outliers leads the robust estimator to behave asymptotically as the weighted sum of the sample outer products, with a constant weight for all legitimate samples and different weights for the outliers. A fine analysis of this structure reveals importantly that the propensity of the M-estimator to attenuate (or enhance) the impact of outliers is mostly dictated by the alignment of the outliers with the inverse population covariance matrix of the legitimate samples. Thus, robust M-estimators can bring substantial benefits over more simplistic estimators such as the per-sample normalized version of the sample covariance matrix, which is not capable of differentiating the outlying samples. The analysis shows that, within the class of Maronna's estimators of scatter, the Huber estimator is most favorable for rejecting outliers. On the contrary, estimators more similar to Tyler's scale invariant estimator (often preferred in the literature) run the risk of inadvertently enhancing some outliers.