25.5DSMay 7
Algorithmic Phase Transition for Large Independent Sets in Dense HypergraphsAbhishek Dhawan, Nhi U. Dinh, Eren C. Kızıldağ et al.
We study the algorithmic tractability of finding large independent sets in dense random hypergraphs. In the sparse regime, much of the natural algorithms can be formulated within either the local or the low-degree polynomial (LDP) framework, and a rich literature has subsequently identified nearly sharp algorithmic thresholds within these classes by exploiting their stability. In the dense setting, however, the algorithmic paradigms are fundamentally different: they are online and thus need not be stable. Perhaps more crucially, even for the classical Erdős-Rényi random graph $G(n,p)$, LDPs are conjectured to fail in the 'easy' regime accessible to online algorithms, thereby challenging their viability for dense models. Our focus is on two models: (i) finding large independent sets in dense $r$-uniform Erdős-Rényi hypergraphs, and (ii) the more challenging problem of finding large $γ$-balanced independent sets in dense $r$-uniform $r$-partite hypergraphs, where the $i$-th coordinate of $γ\in\mathbb{Q}^r$ specifies the proportion of vertices from $V_i$ in the independent set. For both models, we pinpoint the size of the largest independent set and design online algorithms that achieve a multiplicative approximation factor of $r^{1/(r-1)}$ in the uniform and $(\max_i γ_i)^{-1/(r-1)}$ in the $r$-partite model. Furthermore, we establish matching algorithmic lower bounds, showing that these computational gaps are sharp: no online algorithms can breach these gaps.
MLMar 2, 2021
Self-Regularity of Non-Negative Output Weights for Overparameterized Two-Layer Neural NetworksDavid Gamarnik, Eren C. Kızıldağ, Ilias Zadik
We consider the problem of finding a two-layer neural network with sigmoid, rectified linear unit (ReLU), or binary step activation functions that "fits" a training data set as accurately as possible as quantified by the training error; and study the following question: \emph{does a low training error guarantee that the norm of the output layer (outer norm) itself is small?} We answer affirmatively this question for the case of non-negative output weights. Using a simple covering number argument, we establish that under quite mild distributional assumptions on the input/label pairs; any such network achieving a small training error on polynomially many data necessarily has a well-controlled outer norm. Notably, our results (a) have a polynomial (in $d$) sample complexity, (b) are independent of the number of hidden units (which can potentially be very high), (c) are oblivious to the training algorithm; and (d) require quite mild assumptions on the data (in particular the input vector $X\in\mathbb{R}^d$ need not have independent coordinates). We then leverage our bounds to establish generalization guarantees for such networks through \emph{fat-shattering dimension}, a scale-sensitive measure of the complexity class that the network architectures we investigate belong to. Notably, our generalization bounds also have good sample complexity (polynomials in $d$ with a low degree), and are in fact near-linear for some important cases of interest.
MLMar 23, 2020
Neural Networks and Polynomial Regression. Demystifying the Overparametrization PhenomenaMatt Emschwiller, David Gamarnik, Eren C. Kızıldağ et al.
In the context of neural network models, overparametrization refers to the phenomena whereby these models appear to generalize well on the unseen data, even though the number of parameters significantly exceeds the sample sizes, and the model perfectly fits the in-training data. A conventional explanation of this phenomena is based on self-regularization properties of algorithms used to train the data. In this paper we prove a series of results which provide a somewhat diverging explanation. Adopting a teacher/student model where the teacher network is used to generate the predictions and student network is trained on the observed labeled data, and then tested on out-of-sample data, we show that any student network interpolating the data generated by a teacher network generalizes well, provided that the sample size is at least an explicit quantity controlled by data dimension and approximation guarantee alone, regardless of the number of internal nodes of either teacher or student network. Our claim is based on approximating both teacher and student networks by polynomial (tensor) regression models with degree depending on the desired accuracy and network depth only. Such a parametrization notably does not depend on the number of internal nodes. Thus a message implied by our results is that parametrizing wide neural networks by the number of hidden nodes is misleading, and a more fitting measure of parametrization complexity is the number of regression coefficients associated with tensorized data. In particular, this somewhat reconciles the generalization ability of neural networks with more classical statistical notions of data complexity and generalization bounds. Our empirical results on MNIST and Fashion-MNIST datasets indeed confirm that tensorized regression achieves a good out-of-sample performance, even when the degree of the tensor is at most two.
MLDec 3, 2019
Stationary Points of Shallow Neural Networks with Quadratic Activation FunctionDavid Gamarnik, Eren C. Kızıldağ, Ilias Zadik
We consider the teacher-student setting of learning shallow neural networks with quadratic activations and planted weight matrix $W^*\in\mathbb{R}^{m\times d}$, where $m$ is the width of the hidden layer and $d\le m$ is the data dimension. We study the optimization landscape associated with the empirical and the population squared risk of the problem. Under the assumption the planted weights are full-rank we obtain the following results. First, we establish that the landscape of the empirical risk admits an "energy barrier" separating rank-deficient $W$ from $W^*$: if $W$ is rank deficient, then its risk is bounded away from zero by an amount we quantify. We then couple this result by showing that, assuming number $N$ of samples grows at least like a polynomial function of $d$, all full-rank approximate stationary points of the empirical risk are nearly global optimum. These two results allow us to prove that gradient descent, when initialized below the energy barrier, approximately minimizes the empirical risk and recovers the planted weights in polynomial-time. Next, we show that initializing below this barrier is in fact easily achieved when the weights are randomly generated under relatively weak assumptions. We show that provided the network is sufficiently overparametrized, initializing with an appropriate multiple of the identity suffices to obtain a risk below the energy barrier. At a technical level, the last result is a consequence of the semicircle law for the Wishart ensemble and could be of independent interest. Finally, we study the minimizers of the empirical risk and identify a simple necessary and sufficient geometric condition on the training data under which any minimizer has necessarily zero generalization error. We show that as soon as $N\ge N^*=d(d+1)/2$, randomly generated data enjoys this geometric condition almost surely, while that ceases to be true if $N<N^*$.
STOct 24, 2019
Inference in High-Dimensional Linear Regression via Lattice Basis Reduction and Integer Relation DetectionDavid Gamarnik, Eren C. Kızıldağ, Ilias Zadik
We focus on the high-dimensional linear regression problem, where the algorithmic goal is to efficiently infer an unknown feature vector $β^*\in\mathbb{R}^p$ from its linear measurements, using a small number $n$ of samples. Unlike most of the literature, we make no sparsity assumption on $β^*$, but instead adopt a different regularization: In the noiseless setting, we assume $β^*$ consists of entries, which are either rational numbers with a common denominator $Q\in\mathbb{Z}^+$ (referred to as $Q$-rationality); or irrational numbers supported on a rationally independent set of bounded cardinality, known to learner; collectively called as the mixed-support assumption. Using a novel combination of the PSLQ integer relation detection, and LLL lattice basis reduction algorithms, we propose a polynomial-time algorithm which provably recovers a $β^*\in\mathbb{R}^p$ enjoying the mixed-support assumption, from its linear measurements $Y=Xβ^*\in\mathbb{R}^n$ for a large class of distributions for the random entries of $X$, even with one measurement $(n=1)$. In the noisy setting, we propose a polynomial-time, lattice-based algorithm, which recovers a $β^*\in\mathbb{R}^p$ enjoying $Q$-rationality, from its noisy measurements $Y=Xβ^*+W\in\mathbb{R}^n$, even with a single sample $(n=1)$. We further establish for large $Q$, and normal noise, this algorithm tolerates information-theoretically optimal level of noise. We then apply these ideas to develop a polynomial-time, single-sample algorithm for the phase retrieval problem. Our methods address the single-sample $(n=1)$ regime, where the sparsity-based methods such as LASSO and Basis Pursuit are known to fail. Furthermore, our results also reveal an algorithmic connection between the high-dimensional linear regression problem, and the integer relation detection, randomized subset-sum, and shortest vector problems.