Ilias Zadik

ST
h-index17
19papers
481citations
Novelty64%
AI Score48

19 Papers

MLJun 15, 2022
Statistical and Computational Phase Transitions in Group Testing

Amin Coja-Oghlan, Oliver Gebhard, Max Hahn-Klimroth et al.

We study the group testing problem where the goal is to identify a set of k infected individuals carrying a rare disease within a population of size n, based on the outcomes of pooled tests which return positive whenever there is at least one infected individual in the tested group. We consider two different simple random procedures for assigning individuals to tests: the constant-column design and Bernoulli design. Our first set of results concerns the fundamental statistical limits. For the constant-column design, we give a new information-theoretic lower bound which implies that the proportion of correctly identifiable infected individuals undergoes a sharp "all-or-nothing" phase transition when the number of tests crosses a particular threshold. For the Bernoulli design, we determine the precise number of tests required to solve the associated detection problem (where the goal is to distinguish between a group testing instance and pure noise), improving both the upper and lower bounds of Truong, Aldridge, and Scarlett (2020). For both group testing models, we also study the power of computationally efficient (polynomial-time) inference procedures. We determine the precise number of tests required for the class of low-degree polynomial algorithms to solve the detection problem. This provides evidence for an inherent computational-statistical gap in both the detection and recovery problems at small sparsity levels. Notably, our evidence is contrary to that of Iliopoulos and Zadik (2021), who predicted the absence of a computational-statistical gap in the Bernoulli design.

96.3STMar 23
Stable Algorithms Lower Bounds for Estimation

Xifan Yu, Ilias Zadik

In this work, we show that for all statistical estimation problems, a natural MMSE instability (discontinuity) condition implies the failure of stable algorithms, serving as a version of OGP for estimation tasks. Using this criterion, we establish separations between stable and polynomial-time algorithms for the following MMSE-unstable tasks (i) Planted Shortest Path, where Dijkstra's algorithm succeeds, (ii) random Parity Codes, where Gaussian elimination succeeds, and (iii) Gaussian Subset Sum, where lattice-based methods succeed. For all three, we further show that all low-degree polynomials are stable, yielding separations against low-degree methods and a new method to bound the low-degree MMSE. In particular, our technique highlights that MMSE instability is a common feature for Shortest Path and the noiseless Parity Codes and Gaussian subset sum. Last, we highlight that our work places rigorous algorithmic footing on the long-standing physics belief that first-order phase transitions--which in this setting translates to MMSE-instability impose fundamental limits on classes of efficient algorithms.

87.7STMar 20
The monotonicity of the Franz-Parisi potential is equivalent with Low-degree MMSE lower bounds

Konstantinos Tsirkas, Leda Wang, Ilias Zadik

Over the last decades, two distinct approaches have been instrumental to our understanding of the computational complexity of statistical estimation. The statistical physics literature predicts algorithmic hardness through local stability and monotonicity properties of the Franz--Parisi (FP) potential \cite{franz1995recipes,franz1997phase}, while the mathematically rigorous literature characterizes hardness via the limitations of restricted algorithmic classes, most notably low-degree polynomial estimators \cite{hopkins2017efficient}. For many inference models, these two perspectives yield strikingly consistent predictions, giving rise to a long-standing open problem of establishing a precise mathematical relationship between them. In this work, we show that for estimation problems the power of low-degree polynomials is equivalent to the monotonicity of the annealed FP potential for a broad family of Gaussian additive models (GAMs) with signal-to-noise ratio $λ$. In particular, subject to a low-degree conjecture for GAMs, our results imply that the polynomial-time limits of these models are directly implied by the monotonicity of the annealed FP potential, in conceptual agreement with predictions from the physics literature dating back to the 1990s.

LGMar 18, 2024
Transfer Learning Beyond Bounded Density Ratios

Alkis Kalavasis, Ilias Zadik, Manolis Zampetakis

We study the fundamental problem of transfer learning where a learning algorithm collects data from some source distribution $P$ but needs to perform well with respect to a different target distribution $Q$. A standard change of measure argument implies that transfer learning happens when the density ratio $dQ/dP$ is bounded. Yet, prior thought-provoking works by Kpotufe and Martinet (COLT, 2018) and Hanneke and Kpotufe (NeurIPS, 2019) demonstrate cases where the ratio $dQ/dP$ is unbounded, but transfer learning is possible. In this work, we focus on transfer learning over the class of low-degree polynomial estimators. Our main result is a general transfer inequality over the domain $\mathbb{R}^n$, proving that non-trivial transfer learning for low-degree polynomials is possible under very mild assumptions, going well beyond the classical assumption that $dQ/dP$ is bounded. For instance, it always applies if $Q$ is a log-concave measure and the inverse ratio $dP/dQ$ is bounded. To demonstrate the applicability of our inequality, we obtain new results in the settings of: (1) the classical truncated regression setting, where $dQ/dP$ equals infinity, and (2) the more recent out-of-distribution generalization setting for in-context learning linear functions with transformers. We also provide a discrete analogue of our transfer inequality on the Boolean Hypercube $\{-1,1\}^n$, and study its connections with the recent problem of Generalization on the Unseen of Abbe, Bengio, Lotfi and Rizk (ICML, 2023). Our main conceptual contribution is that the maximum influence of the error of the estimator $\widehat{f}-f^*$ under $Q$, $\mathrm{I}_{\max}(\widehat{f}-f^*)$, acts as a sufficient condition for transferability; when $\mathrm{I}_{\max}(\widehat{f}-f^*)$ is appropriately bounded, transfer is possible over the Boolean domain.

LGDec 7, 2021
Lattice-Based Methods Surpass Sum-of-Squares in Clustering

Ilias Zadik, Min Jae Song, Alexander S. Wein et al.

Clustering is a fundamental primitive in unsupervised learning which gives rise to a rich class of computationally-challenging inference tasks. In this work, we focus on the canonical task of clustering d-dimensional Gaussian mixtures with unknown (and possibly degenerate) covariance. Recent works (Ghosh et al. '20; Mao, Wein '21; Davis, Diaz, Wang '21) have established lower bounds against the class of low-degree polynomial methods and the sum-of-squares (SoS) hierarchy for recovering certain hidden structures planted in Gaussian clustering instances. Prior work on many similar inference tasks portends that such lower bounds strongly suggest the presence of an inherent statistical-to-computational gap for clustering, that is, a parameter regime where the clustering task is statistically possible but no polynomial-time algorithm succeeds. One special case of the clustering task we consider is equivalent to the problem of finding a planted hypercube vector in an otherwise random subspace. We show that, perhaps surprisingly, this particular clustering model does not exhibit a statistical-to-computational gap, even though the aforementioned low-degree and SoS lower bounds continue to apply in this case. To achieve this, we give a polynomial-time algorithm based on the Lenstra--Lenstra--Lovasz lattice basis reduction method which achieves the statistically-optimal sample complexity of d+1 samples. This result extends the class of problems whose conjectured statistical-to-computational gaps can be "closed" by "brittle" polynomial-time algorithms, highlighting the crucial but subtle role of noise in the onset of statistical-to-computational gaps.

LGJun 20, 2021
On the Cryptographic Hardness of Learning Single Periodic Neurons

Min Jae Song, Ilias Zadik, Joan Bruna

We show a simple reduction which demonstrates the cryptographic hardness of learning a single periodic neuron over isotropic Gaussian distributions in the presence of noise. More precisely, our reduction shows that any polynomial-time algorithm (not necessarily gradient-based) for learning such functions under small noise implies a polynomial-time quantum algorithm for solving worst-case lattice problems, whose hardness form the foundation of lattice-based cryptography. Our core hard family of functions, which are well-approximated by one-layer neural networks, take the general form of a univariate periodic function applied to an affine projection of the data. These functions have appeared in previous seminal works which demonstrate their hardness against gradient-based (Shamir'18), and Statistical Query (SQ) algorithms (Song et al.'17). We show that if (polynomially) small noise is added to the labels, the intractability of learning these functions applies to all polynomial-time algorithms, beyond gradient-based and SQ algorithms, under the aforementioned cryptographic assumptions. Moreover, we demonstrate the necessity of noise in the hardness result by designing a polynomial-time algorithm for learning certain families of such functions under exponentially small adversarial noise. Our proposed algorithm is not a gradient-based or an SQ algorithm, but is rather based on the celebrated Lenstra-Lenstra-Lovász (LLL) lattice basis reduction algorithm. Furthermore, in the absence of noise, this algorithm can be directly applied to solve CLWE detection (Bruna et al.'21) and phase retrieval with an optimal sample complexity of $d+1$ samples. In the former case, this improves upon the quadratic-in-$d$ sample complexity required in (Bruna et al.'21).

MLMar 2, 2021
Self-Regularity of Non-Negative Output Weights for Overparameterized Two-Layer Neural Networks

David Gamarnik, Eren C. Kızıldağ, Ilias Zadik

We consider the problem of finding a two-layer neural network with sigmoid, rectified linear unit (ReLU), or binary step activation functions that "fits" a training data set as accurately as possible as quantified by the training error; and study the following question: \emph{does a low training error guarantee that the norm of the output layer (outer norm) itself is small?} We answer affirmatively this question for the case of non-negative output weights. Using a simple covering number argument, we establish that under quite mild distributional assumptions on the input/label pairs; any such network achieving a small training error on polynomially many data necessarily has a well-controlled outer norm. Notably, our results (a) have a polynomial (in $d$) sample complexity, (b) are independent of the number of hidden units (which can potentially be very high), (c) are oblivious to the training algorithm; and (d) require quite mild assumptions on the data (in particular the input vector $X\in\mathbb{R}^d$ need not have independent coordinates). We then leverage our bounds to establish generalization guarantees for such networks through \emph{fat-shattering dimension}, a scale-sensitive measure of the complexity class that the network architectures we investigate belong to. Notably, our generalization bounds also have good sample complexity (polynomials in $d$ with a low degree), and are in fact near-linear for some important cases of interest.

STNov 12, 2020
Optimal Private Median Estimation under Minimal Distributional Assumptions

Christos Tzamos, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Ilias Zadik

We study the fundamental task of estimating the median of an underlying distribution from a finite number of samples, under pure differential privacy constraints. We focus on distributions satisfying the minimal assumption that they have a positive density at a small neighborhood around the median. In particular, the distribution is allowed to output unbounded values and is not required to have finite moments. We compute the exact, up-to-constant terms, statistical rate of estimation for the median by providing nearly-tight upper and lower bounds. Furthermore, we design a polynomial-time differentially private algorithm which provably achieves the optimal performance. At a technical level, our results leverage a Lipschitz Extension Lemma which allows us to design and analyze differentially private algorithms solely on appropriately defined "typical" instances of the samples.

PRJun 18, 2020
Free Energy Wells and Overlap Gap Property in Sparse PCA

Gérard Ben Arous, Alexander S. Wein, Ilias Zadik

We study a variant of the sparse PCA (principal component analysis) problem in the "hard" regime, where the inference task is possible yet no polynomial-time algorithm is known to exist. Prior work, based on the low-degree likelihood ratio, has conjectured a precise expression for the best possible (sub-exponential) runtime throughout the hard regime. Following instead a statistical physics inspired point of view, we show bounds on the depth of free energy wells for various Gibbs measures naturally associated to the problem. These free energy wells imply hitting time lower bounds that corroborate the low-degree conjecture: we show that a class of natural MCMC (Markov chain Monte Carlo) methods (with worst-case initialization) cannot solve sparse PCA with less than the conjectured runtime. These lower bounds apply to a wide range of values for two tuning parameters: temperature and sparsity misparametrization. Finally, we prove that the Overlap Gap Property (OGP), a structural property that implies failure of certain local search algorithms, holds in a significant part of the hard regime.

MLMar 23, 2020
Neural Networks and Polynomial Regression. Demystifying the Overparametrization Phenomena

Matt Emschwiller, David Gamarnik, Eren C. Kızıldağ et al.

In the context of neural network models, overparametrization refers to the phenomena whereby these models appear to generalize well on the unseen data, even though the number of parameters significantly exceeds the sample sizes, and the model perfectly fits the in-training data. A conventional explanation of this phenomena is based on self-regularization properties of algorithms used to train the data. In this paper we prove a series of results which provide a somewhat diverging explanation. Adopting a teacher/student model where the teacher network is used to generate the predictions and student network is trained on the observed labeled data, and then tested on out-of-sample data, we show that any student network interpolating the data generated by a teacher network generalizes well, provided that the sample size is at least an explicit quantity controlled by data dimension and approximation guarantee alone, regardless of the number of internal nodes of either teacher or student network. Our claim is based on approximating both teacher and student networks by polynomial (tensor) regression models with degree depending on the desired accuracy and network depth only. Such a parametrization notably does not depend on the number of internal nodes. Thus a message implied by our results is that parametrizing wide neural networks by the number of hidden nodes is misleading, and a more fitting measure of parametrization complexity is the number of regression coefficients associated with tensorized data. In particular, this somewhat reconciles the generalization ability of neural networks with more classical statistical notions of data complexity and generalization bounds. Our empirical results on MNIST and Fashion-MNIST datasets indeed confirm that tensorized regression achieves a good out-of-sample performance, even when the degree of the tensor is at most two.

MLDec 3, 2019
Stationary Points of Shallow Neural Networks with Quadratic Activation Function

David Gamarnik, Eren C. Kızıldağ, Ilias Zadik

We consider the teacher-student setting of learning shallow neural networks with quadratic activations and planted weight matrix $W^*\in\mathbb{R}^{m\times d}$, where $m$ is the width of the hidden layer and $d\le m$ is the data dimension. We study the optimization landscape associated with the empirical and the population squared risk of the problem. Under the assumption the planted weights are full-rank we obtain the following results. First, we establish that the landscape of the empirical risk admits an "energy barrier" separating rank-deficient $W$ from $W^*$: if $W$ is rank deficient, then its risk is bounded away from zero by an amount we quantify. We then couple this result by showing that, assuming number $N$ of samples grows at least like a polynomial function of $d$, all full-rank approximate stationary points of the empirical risk are nearly global optimum. These two results allow us to prove that gradient descent, when initialized below the energy barrier, approximately minimizes the empirical risk and recovers the planted weights in polynomial-time. Next, we show that initializing below this barrier is in fact easily achieved when the weights are randomly generated under relatively weak assumptions. We show that provided the network is sufficiently overparametrized, initializing with an appropriate multiple of the identity suffices to obtain a risk below the energy barrier. At a technical level, the last result is a consequence of the semicircle law for the Wishart ensemble and could be of independent interest. Finally, we study the minimizers of the empirical risk and identify a simple necessary and sufficient geometric condition on the training data under which any minimizer has necessarily zero generalization error. We show that as soon as $N\ge N^*=d(d+1)/2$, randomly generated data enjoys this geometric condition almost surely, while that ceases to be true if $N<N^*$.

STOct 24, 2019
Inference in High-Dimensional Linear Regression via Lattice Basis Reduction and Integer Relation Detection

David Gamarnik, Eren C. Kızıldağ, Ilias Zadik

We focus on the high-dimensional linear regression problem, where the algorithmic goal is to efficiently infer an unknown feature vector $β^*\in\mathbb{R}^p$ from its linear measurements, using a small number $n$ of samples. Unlike most of the literature, we make no sparsity assumption on $β^*$, but instead adopt a different regularization: In the noiseless setting, we assume $β^*$ consists of entries, which are either rational numbers with a common denominator $Q\in\mathbb{Z}^+$ (referred to as $Q$-rationality); or irrational numbers supported on a rationally independent set of bounded cardinality, known to learner; collectively called as the mixed-support assumption. Using a novel combination of the PSLQ integer relation detection, and LLL lattice basis reduction algorithms, we propose a polynomial-time algorithm which provably recovers a $β^*\in\mathbb{R}^p$ enjoying the mixed-support assumption, from its linear measurements $Y=Xβ^*\in\mathbb{R}^n$ for a large class of distributions for the random entries of $X$, even with one measurement $(n=1)$. In the noisy setting, we propose a polynomial-time, lattice-based algorithm, which recovers a $β^*\in\mathbb{R}^p$ enjoying $Q$-rationality, from its noisy measurements $Y=Xβ^*+W\in\mathbb{R}^n$, even with a single sample $(n=1)$. We further establish for large $Q$, and normal noise, this algorithm tolerates information-theoretically optimal level of noise. We then apply these ideas to develop a polynomial-time, single-sample algorithm for the phase retrieval problem. Our methods address the single-sample $(n=1)$ regime, where the sparsity-based methods such as LASSO and Basis Pursuit are known to fail. Furthermore, our results also reveal an algorithmic connection between the high-dimensional linear regression problem, and the integer relation detection, randomized subset-sum, and shortest vector problems.

STApr 15, 2019
The Landscape of the Planted Clique Problem: Dense subgraphs and the Overlap Gap Property

David Gamarnik, Ilias Zadik

In this paper we study the computational-statistical gap of the planted clique problem, where a clique of size $k$ is planted in an Erdos Renyi graph $G(n,\frac{1}{2})$ resulting in a graph $G\left(n,\frac{1}{2},k\right)$. The goal is to recover the planted clique vertices by observing $G\left(n,\frac{1}{2},k\right)$ . It is known that the clique can be recovered as long as $k \geq \left(2+ε\right)\log n $ for any $ε>0$, but no polynomial-time algorithm is known for this task unless $k=Ω\left(\sqrt{n} \right)$. Following a statistical-physics inspired point of view as an attempt to understand this computational-statistical gap, we study the landscape of the "sufficiently dense" subgraphs of $G$ and their overlap with the planted clique. Using the first moment method, we study the densest subgraph problems for subgraphs with fixed, but arbitrary, overlap size with the planted clique, and provide evidence of a phase transition for the presence of Overlap Gap Property (OGP) at $k=Θ\left(\sqrt{n}\right)$. OGP is a concept introduced originally in spin glass theory and known to suggest algorithmic hardness when it appears. We establish the presence of OGP when $k$ is a small positive power of $n$ by using a conditional second moment method. As our main technical tool, we establish the first, to the best of our knowledge, concentration results for the $K$-densest subgraph problem for the Erdos-Renyi model $G\left(n,\frac{1}{2}\right)$ when $K=n^{0.5-ε}$ for arbitrary $ε>0$. Finally, to study the OGP we employ a certain form of overparametrization, which is conceptually aligned with a large body of recent work in learning theory and optimization.

STOct 30, 2018
Private Algorithms Can Always Be Extended

Christian Borgs, Jennifer Chayes, Adam Smith et al.

We consider the following fundamental question on $ε$-differential privacy. Consider an arbitrary $ε$-differentially private algorithm defined on a subset of the input space. Is it possible to extend it to an $ε'$-differentially private algorithm on the whole input space for some $ε'$ comparable with $ε$? In this note we answer affirmatively this question for $ε'=2ε$. Our result applies to every input metric space and space of possible outputs. This result originally appeared in a recent paper by the authors [BCSZ18]. We present a self-contained version in this note, in the hopes that it will be broadly useful.

STOct 4, 2018
Revealing Network Structure, Confidentially: Improved Rates for Node-Private Graphon Estimation

Christian Borgs, Jennifer Chayes, Adam Smith et al.

Motivated by growing concerns over ensuring privacy on social networks, we develop new algorithms and impossibility results for fitting complex statistical models to network data subject to rigorous privacy guarantees. We consider the so-called node-differentially private algorithms, which compute information about a graph or network while provably revealing almost no information about the presence or absence of a particular node in the graph. We provide new algorithms for node-differentially private estimation for a popular and expressive family of network models: stochastic block models and their generalization, graphons. Our algorithms improve on prior work, reducing their error quadratically and matching, in many regimes, the optimal nonprivate algorithm. We also show that for the simplest random graph models ($G(n,p)$ and $G(n,m)$), node-private algorithms can be qualitatively more accurate than for more complex models---converging at a rate of $\frac{1}{ε^2 n^{3}}$ instead of $\frac{1}{ε^2 n^2}$. This result uses a new extension lemma for differentially private algorithms that we hope will be broadly useful.

STMar 18, 2018
High Dimensional Linear Regression using Lattice Basis Reduction

David Gamarnik, Ilias Zadik

We consider a high dimensional linear regression problem where the goal is to efficiently recover an unknown vector $β^*$ from $n$ noisy linear observations $Y=Xβ^*+W \in \mathbb{R}^n$, for known $X \in \mathbb{R}^{n \times p}$ and unknown $W \in \mathbb{R}^n$. Unlike most of the literature on this model we make no sparsity assumption on $β^*$. Instead we adopt a regularization based on assuming that the underlying vectors $β^*$ have rational entries with the same denominator $Q \in \mathbb{Z}_{>0}$. We call this $Q$-rationality assumption. We propose a new polynomial-time algorithm for this task which is based on the seminal Lenstra-Lenstra-Lovasz (LLL) lattice basis reduction algorithm. We establish that under the $Q$-rationality assumption, our algorithm recovers exactly the vector $β^*$ for a large class of distributions for the iid entries of $X$ and non-zero noise $W$. We prove that it is successful under small noise, even when the learner has access to only one observation ($n=1$). Furthermore, we prove that in the case of the Gaussian white noise for $W$, $n=o\left(p/\log p\right)$ and $Q$ sufficiently large, our algorithm tolerates a nearly optimal information-theoretic level of the noise.

STNov 14, 2017
Sparse High-Dimensional Linear Regression. Algorithmic Barriers and a Local Search Algorithm

David Gamarnik, Ilias Zadik

We consider a sparse high dimensional regression model where the goal is to recover a $k$-sparse unknown vector $β^*$ from $n$ noisy linear observations of the form $Y=Xβ^*+W \in \mathbb{R}^n$ where $X \in \mathbb{R}^{n \times p}$ has iid $N(0,1)$ entries and $W \in \mathbb{R}^n$ has iid $N(0,σ^2)$ entries. Under certain assumptions on the parameters, an intriguing assymptotic gap appears between the minimum value of $n$, call it $n^*$, for which the recovery is information theoretically possible, and the minimum value of $n$, call it $n_{\mathrm{alg}}$, for which an efficient algorithm is known to provably recover $β^*$. In \cite{gamarnikzadik} it was conjectured that the gap is not artificial, in the sense that for sample sizes $n \in [n^*,n_{\mathrm{alg}}]$ the problem is algorithmically hard. We support this conjecture in two ways. Firstly, we show that the optimal solution of the LASSO provably fails to $\ell_2$-stably recover the unknown vector $β^*$ when $n \in [n^*,c n_{\mathrm{alg}}]$, for some sufficiently small constant $c>0$. Secondly, we establish that $n_{\mathrm{alg}}$, up to a multiplicative constant factor, is a phase transition point for the appearance of a certain Overlap Gap Property (OGP) over the space of $k$-sparse vectors. The presence of such an Overlap Gap Property phase transition, which originates in statistical physics, is known to provide evidence of an algorithmic hardness. Finally we show that if $n>C n_{\mathrm{alg}}$ for some large enough constant $C>0$, a very simple algorithm based on a local search improvement rule is able both to $\ell_2$-stably recover the unknown vector $β^*$ and to infer correctly its support, adding it to the list of provably successful algorithms for the high dimensional linear regression problem.

LGNov 1, 2017
Orthogonal Machine Learning: Power and Limitations

Lester Mackey, Vasilis Syrgkanis, Ilias Zadik

Double machine learning provides $\sqrt{n}$-consistent estimates of parameters of interest even when high-dimensional or nonparametric nuisance parameters are estimated at an $n^{-1/4}$ rate. The key is to employ Neyman-orthogonal moment equations which are first-order insensitive to perturbations in the nuisance parameters. We show that the $n^{-1/4}$ requirement can be improved to $n^{-1/(2k+2)}$ by employing a $k$-th order notion of orthogonality that grants robustness to more complex or higher-dimensional nuisance parameters. In the partially linear regression setting popular in causal inference, we show that we can construct second-order orthogonal moments if and only if the treatment residual is not normally distributed. Our proof relies on Stein's lemma and may be of independent interest. We conclude by demonstrating the robustness benefits of an explicit doubly-orthogonal estimation procedure for treatment effect.

MLJan 16, 2017
High-Dimensional Regression with Binary Coefficients. Estimating Squared Error and a Phase Transition

David Gamarnik, Ilias Zadik

We consider a sparse linear regression model Y=Xβ^{*}+W where X has a Gaussian entries, W is the noise vector with mean zero Gaussian entries, and β^{*} is a binary vector with support size (sparsity) k. Using a novel conditional second moment method we obtain a tight up to a multiplicative constant approximation of the optimal squared error \min_β\|Y-Xβ\|_{2}, where the minimization is over all k-sparse binary vectors β. The approximation reveals interesting structural properties of the underlying regression problem. In particular, a) We establish that n^*=2k\log p/\log (2k/σ^{2}+1) is a phase transition point with the following "all-or-nothing" property. When n exceeds n^{*}, (2k)^{-1}\|β_{2}-β^*\|_0\approx 0, and when n is below n^{*}, (2k)^{-1}\|β_{2}-β^*\|_0\approx 1, where β_2 is the optimal solution achieving the smallest squared error. With this we prove that n^{*} is the asymptotic threshold for recovering β^* information theoretically. b) We compute the squared error for an intermediate problem \min_β\|Y-Xβ\|_{2} where minimization is restricted to vectors βwith \|β-β^{*}\|_0=2k ζ, for ζ\in [0,1]. We show that a lower bound part Γ(ζ) of the estimate, which corresponds to the estimate based on the first moment method, undergoes a phase transition at three different thresholds, namely n_{\text{inf,1}}=σ^2\log p, which is information theoretic bound for recovering β^* when k=1 and σis large, then at n^{*} and finally at n_{\text{LASSO/CS}}. c) We establish a certain Overlap Gap Property (OGP) on the space of all binary vectors βwhen n\le ck\log p for sufficiently small constant c. We conjecture that OGP is the source of algorithmic hardness of solving the minimization problem \min_β\|Y-Xβ\|_{2} in the regime n<n_{\text{LASSO/CS}}.