LGMay 22
Representation Alignment Rests on Linear StructureKiril Bangachev, Guy Bresler, Yury Polyanskiy
We investigate the Platonic Representation Hypothesis (PRH) through a tripartite statistical framework of representations: signal, bias, and noise. {1) Signal:} We propose that Platonic alignment arises from the universal relationship between objects and attributes, which is encoded linearly in representations according to the Linear Representation Hypothesis (LRH). We provide evidence that LRH helps explain PRH by extracting linear object-attribute features with sparse autoencoders and showing that these sparse representations often exhibit stronger cross-modal alignment than their dense counterparts. {2) Bias:} Models have different implicit biases due to the diverse architectures and training procedures used. We show that this difference can be partially mitigated. Centering and normalization consistently improve cross-model alignment. {3) Noise:} Finite-sample training leads to noise in representations. We provide evidence that representational noise is driven by data scarcity by revealing a strong and consistent positive correlation between word frequency and alignment in LLMs and text embedding models. Synthesizing signal, bias, and noise, we propose a statistical model that refines the Linear Representation Hypothesis and explains further phenomena related to the alignment of representations emerging from diverse modern AI architectures.
LGMay 22
Is Dimensionality a Barrier for Retrieval Models?Kiril Bangachev, Guy Bresler, Jonathan Kogan et al.
Why does the low dimensionality of representations, typically $d\approx 1000$, not prevent modern embedding-based retrieval models from scaling to billions, or even trillions, of data points? To answer this question, we study maximal-margin embeddings in the following retrieval model, classically studied in communication complexity [PS86] and more recently in embedding-based retrieval [WBNL26]. Let $A\in \{0,1\}^{N\times n}$ be a matrix indicating whether each of $N$ queries is relevant to each of $n$ documents. We are interested in the largest margin $m>0,$ denoted by $\mathsf{m}^{\mathsf{rd}}(d, A),$ for which there exist unit norm embeddings of the queries and documents $\{U_j\}_{j = 1}^N, \{V_i\}_{i = 1}^n$ with the following property. $\langle U_j, V_i\rangle \ge m$ whenever $A_{ji} = 1$ and $\langle U_j, V_i\rangle \le -m$ otherwise. A large margin is a key proxy for representation quality: it controls both robustness to perturbations and compositional generalization across queries. Our main theorem establishes that the best possible margin without a restriction on the dimension, $\mathsf{m}^{\mathsf{rd}}(+\infty, A),$ can be nearly achieved in dimension $d = O(\mathsf{m}^{\mathsf{rd}}(+\infty, A)^{-2}\log n)$ which improves a theorem of [BDES02]. Together with a matching lower bound in Theorem 1.5, we conclude that when $A\in \{0,1\}^{\binom{n}{k}\times n}$ is the matrix containing all possible $k$-sparse rows once, dimension $d = O(k\log (n/k))$ is necessary and sufficient for the maximal possible margin $\mathsf{m}^{\mathsf{rd}}(+\infty, A) = Θ(k^{-1/2})$ in this setting. This fully resolves the setup of [WBNL26]. We also give several constructions for large margins when $d = o(k\log (n/k)).$ Finally, we empirically test the InfoNCE and sigmoid losses for producing large margin embeddings and demonstrate a clear advantage of the sigmoid loss.
CCApr 2
Average-Case Reductions for $k$-XOR and Tensor PCAGuy Bresler, Alina Harbuzova
We study the computational properties of two canonical planted average-case problems -- noisy planted $k$-XOR and Tensor PCA -- by formally unifying them into a family of planted problems parametrized by tensor order $k$, number of entries $m$, and noise level $δ$. We build a wide range of poly-time average-case reductions within this family, across all regimes $m \in [1, n^k]$. In the denser $m \geq n^{k/2}$ regime, our reductions preserve proximity to the computational threshold, and, as a central application, reduce conjectured-hard $k$-XOR instances with $m \approx n^{k/2}$ to conjectured-hard instances of Tensor PCA. Additionally, we give new order-reducing maps at fixed densities (e.g., $5\to 4$ for $k$-XOR with $m \approx n^{k/2}$ entries and $7\to 4$ for Tensor PCA). In the sparser $m \leq n^{k/2}$ regime, we relate instances of different orders, reducing, for example, $7$-XOR with $m = n^{3.4}$ to the classical setting of $3$-XOR with $m = \widetildeÎ(n^{1.4})$. Taken together, these results establish a hardness partial order in the space of planted tensor models.
LGSep 23, 2025
Global Minimizers of Sigmoid Contrastive LossKiril Bangachev, Guy Bresler, Iliyas Noman et al.
The meta-task of obtaining and aligning representations through contrastive pretraining is steadily gaining importance since its introduction in CLIP and ALIGN. In this paper we theoretically explain the advantages of synchronizing with trainable inverse temperature and bias under the sigmoid loss, as implemented in the recent SigLIP and SigLIP2 models of Google DeepMind. Temperature and bias can drive the loss function to zero for a rich class of configurations that we call $(\mathsf{m}, \mathsf{b}_{\mathsf{rel}})$-Constellations. $(\mathsf{m}, \mathsf{b}_{\mathsf{rel}})$-Constellations are a novel combinatorial object related to spherical codes and are parametrized by a margin $\mathsf{m}$ and relative bias $\mathsf{b}_{\mathsf{rel}}$. We use our characterization of constellations to theoretically justify the success of SigLIP on retrieval, to explain the modality gap present in SigLIP, and to identify the necessary dimension for producing high-quality representations. Finally, we propose a reparameterization of the sigmoid loss with explicit relative bias, which improves training dynamics in experiments with synthetic data.
MLApr 28, 2025
Optimal Sequential Recommendations: Exploiting User and Item StructureMina Karzand, Guy Bresler
We consider an online model for recommendation systems, with each user being recommended an item at each time-step and providing 'like' or 'dislike' feedback. A latent variable model specifies the user preferences: both users and items are clustered into types. The model captures structure in both the item and user spaces, as used by item-item and user-user collaborative filtering algorithms. We study the situation in which the type preference matrix has i.i.d. entries. Our main contribution is an algorithm that simultaneously uses both item and user structures, proved to be near-optimal via corresponding information-theoretic lower bounds. In particular, our analysis highlights the sub-optimality of using only one of item or user structure (as is done in most collaborative filtering algorithms).
LGAug 24, 2021
The staircase property: How hierarchical structure can guide deep learningEmmanuel Abbe, Enric Boix-Adsera, Matthew Brennan et al.
This paper identifies a structural property of data distributions that enables deep neural networks to learn hierarchically. We define the "staircase" property for functions over the Boolean hypercube, which posits that high-order Fourier coefficients are reachable from lower-order Fourier coefficients along increasing chains. We prove that functions satisfying this property can be learned in polynomial time using layerwise stochastic coordinate descent on regular neural networks -- a class of network architectures and initializations that have homogeneity properties. Our analysis shows that for such staircase functions and neural networks, the gradient-based algorithm learns high-level features by greedily combining lower-level features along the depth of the network. We further back our theoretical results with experiments showing that staircase functions are also learnable by more standard ResNet architectures with stochastic gradient descent. Both the theoretical and experimental results support the fact that staircase properties have a role to play in understanding the capabilities of gradient-based learning on regular networks, in contrast to general polynomial-size networks that can emulate any SQ or PAC algorithms as recently shown.
LGJun 7, 2021
Chow-Liu++: Optimal Prediction-Centric Learning of Tree Ising ModelsEnric Boix-Adsera, Guy Bresler, Frederic Koehler
We consider the problem of learning a tree-structured Ising model from data, such that subsequent predictions computed using the model are accurate. Concretely, we aim to learn a model such that posteriors $P(X_i|X_S)$ for small sets of variables $S$ are accurate. Since its introduction more than 50 years ago, the Chow-Liu algorithm, which efficiently computes the maximum likelihood tree, has been the benchmark algorithm for learning tree-structured graphical models. A bound on the sample complexity of the Chow-Liu algorithm with respect to the prediction-centric local total variation loss was shown in [BK19]. While those results demonstrated that it is possible to learn a useful model even when recovering the true underlying graph is impossible, their bound depends on the maximum strength of interactions and thus does not achieve the information-theoretic optimum. In this paper, we introduce a new algorithm that carefully combines elements of the Chow-Liu algorithm with tree metric reconstruction methods to efficiently and optimally learn tree Ising models under a prediction-centric loss. Our algorithm is robust to model misspecification and adversarial corruptions. In contrast, we show that the celebrated Chow-Liu algorithm can be arbitrarily suboptimal.
CCJun 3, 2021
The Algorithmic Phase Transition of Random $k$-SAT for Low Degree PolynomialsGuy Bresler, Brice Huang
Let $Φ$ be a uniformly random $k$-SAT formula with $n$ variables and $m$ clauses. We study the algorithmic task of finding a satisfying assignment of $Φ$. It is known that satisfying assignments exist with high probability up to clause density $m/n = 2^k \log 2 - \frac12 (\log 2 + 1) + o_k(1)$, while the best polynomial-time algorithm known, the Fix algorithm of Coja-Oghlan, finds a satisfying assignment at the much lower clause density $(1 - o_k(1)) 2^k \log k / k$. This prompts the question: is it possible to efficiently find a satisfying assignment at higher clause densities? We prove that the class of low degree polynomial algorithms cannot find a satisfying assignment at clause density $(1 + o_k(1)) κ^* 2^k \log k / k$ for a universal constant $κ^* \approx 4.911$. This class encompasses Fix, message passing algorithms including Belief and Survey Propagation guided decimation (with bounded or mildly growing number of rounds), and local algorithms on the factor graph. This is the first hardness result for any class of algorithms at clause density within a constant factor of that achieved by Fix. Our proof establishes and leverages a new many-way overlap gap property tailored to random $k$-SAT.
STMar 29, 2021
The EM Algorithm is Adaptively-Optimal for Unbalanced Symmetric Gaussian MixturesNir Weinberger, Guy Bresler
This paper studies the problem of estimating the means $\pmθ_{*}\in\mathbb{R}^{d}$ of a symmetric two-component Gaussian mixture $δ_{*}\cdot N(θ_{*},I)+(1-δ_{*})\cdot N(-θ_{*},I)$ where the weights $δ_{*}$ and $1-δ_{*}$ are unequal. Assuming that $δ_{*}$ is known, we show that the population version of the EM algorithm globally converges if the initial estimate has non-negative inner product with the mean of the larger weight component. This can be achieved by the trivial initialization $θ_{0}=0$. For the empirical iteration based on $n$ samples, we show that when initialized at $θ_{0}=0$, the EM algorithm adaptively achieves the minimax error rate $\tilde{O}\Big(\min\Big\{\frac{1}{(1-2δ_{*})}\sqrt{\frac{d}{n}},\frac{1}{\|θ_{*}\|}\sqrt{\frac{d}{n}},\left(\frac{d}{n}\right)^{1/4}\Big\}\Big)$ in no more than $O\Big(\frac{1}{\|θ_{*}\|(1-2δ_{*})}\Big)$ iterations (with high probability). We also consider the EM iteration for estimating the weight $δ_{*}$, assuming a fixed mean $θ$ (which is possibly mismatched to $θ_{*}$). For the empirical iteration of $n$ samples, we show that the minimax error rate $\tilde{O}\Big(\frac{1}{\|θ_{*}\|}\sqrt{\frac{d}{n}}\Big)$ is achieved in no more than $O\Big(\frac{1}{\|θ_{*}\|^{2}}\Big)$ iterations. These results robustify and complement recent results of Wu and Zhou obtained for the equal weights case $δ_{*}=1/2$.
CCSep 13, 2020
Statistical Query Algorithms and Low-Degree Tests Are Almost EquivalentMatthew Brennan, Guy Bresler, Samuel B. Hopkins et al.
Researchers currently use a number of approaches to predict and substantiate information-computation gaps in high-dimensional statistical estimation problems. A prominent approach is to characterize the limits of restricted models of computation, which on the one hand yields strong computational lower bounds for powerful classes of algorithms and on the other hand helps guide the development of efficient algorithms. In this paper, we study two of the most popular restricted computational models, the statistical query framework and low-degree polynomials, in the context of high-dimensional hypothesis testing. Our main result is that under mild conditions on the testing problem, the two classes of algorithms are essentially equivalent in power. As corollaries, we obtain new statistical query lower bounds for sparse PCA, tensor PCA and several variants of the planted clique problem.
LGJun 16, 2020
Least Squares Regression with Markovian Data: Fundamental Limits and AlgorithmsGuy Bresler, Prateek Jain, Dheeraj Nagaraj et al.
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain. We establish sharp information theoretic minimax lower bounds for this problem in terms of $τ_{\mathsf{mix}}$, the mixing time of the underlying Markov chain, under different noise settings. Our results establish that in general, optimization with Markovian data is strictly harder than optimization with independent data and a trivial algorithm (SGD-DD) that works with only one in every $\tildeΘ(τ_{\mathsf{mix}})$ samples, which are approximately independent, is minimax optimal. In fact, it is strictly better than the popular Stochastic Gradient Descent (SGD) method with constant step-size which is otherwise minimax optimal in the regression with independent data setting. Beyond a worst case analysis, we investigate whether structured datasets seen in practice such as Gaussian auto-regressive dynamics can admit more efficient optimization schemes. Surprisingly, even in this specific and natural setting, Stochastic Gradient Descent (SGD) with constant step-size is still no better than SGD-DD. Instead, we propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate. Our improved rate serves as one of the first results where an algorithm outperforms SGD-DD on an interesting Markov chain and also provides one of the first theoretical analyses to support the use of experience replay in practice.
LGJun 7, 2020
Learning Restricted Boltzmann Machines with Sparse Latent VariablesGuy Bresler, Rares-Darius Buhai
Restricted Boltzmann Machines (RBMs) are a common family of undirected graphical models with latent variables. An RBM is described by a bipartite graph, with all observed variables in one layer and all latent variables in the other. We consider the task of learning an RBM given samples generated according to it. The best algorithms for this task currently have time complexity $\tilde{O}(n^2)$ for ferromagnetic RBMs (i.e., with attractive potentials) but $\tilde{O}(n^d)$ for general RBMs, where $n$ is the number of observed variables and $d$ is the maximum degree of a latent variable. Let the MRF neighborhood of an observed variable be its neighborhood in the Markov Random Field of the marginal distribution of the observed variables. In this paper, we give an algorithm for learning general RBMs with time complexity $\tilde{O}(n^{2^s+1})$, where $s$ is the maximum number of latent variables connected to the MRF neighborhood of an observed variable. This is an improvement when $s < \log_2 (d-1)$, which corresponds to RBMs with sparse latent variables. Furthermore, we give a version of this learning algorithm that recovers a model with small prediction error and whose sample complexity is independent of the minimum potential in the Markov Random Field of the observed variables. This is of interest because the sample complexity of current algorithms scales with the inverse of the minimum potential, which cannot be controlled in terms of natural properties of the RBM.
MLJun 7, 2020
Sharp Representation Theorems for ReLU Networks with Precise Dependence on DepthGuy Bresler, Dheeraj Nagaraj
We prove sharp dimension-free representation results for neural networks with $D$ ReLU layers under square loss for a class of functions $\mathcal{G}_D$ defined in the paper. These results capture the precise benefits of depth in the following sense: 1. The rates for representing the class of functions $\mathcal{G}_D$ via $D$ ReLU layers is sharp up to constants, as shown by matching lower bounds. 2. For each $D$, $\mathcal{G}_{D} \subseteq \mathcal{G}_{D+1}$ and as $D$ grows the class of functions $\mathcal{G}_{D}$ contains progressively less smooth functions. 3. If $D^{\prime} < D$, then the approximation rate for the class $\mathcal{G}_D$ achieved by depth $D^{\prime}$ networks is strictly worse than that achieved by depth $D$ networks. This constitutes a fine-grained characterization of the representation power of feedforward networks of arbitrary depth $D$ and number of neurons $N$, in contrast to existing representation results which either require $D$ growing quickly with $N$ or assume that the function being represented is highly smooth. In the latter case similar rates can be obtained with a single nonlinear layer. Our results confirm the prevailing hypothesis that deeper networks are better at representing less smooth functions, and indeed, the main technical novelty is to fully exploit the fact that deep networks can produce highly oscillatory functions with few activation functions.
CCMay 16, 2020
Reducibility and Statistical-Computational Gaps from Secret LeakageMatthew Brennan, Guy Bresler
Inference problems with conjectured statistical-computational gaps are ubiquitous throughout modern statistics, computer science and statistical physics. While there has been success evidencing these gaps from the failure of restricted classes of algorithms, progress towards a more traditional reduction-based approach to computational complexity in statistical inference has been limited. Existing reductions have largely been limited to inference problems with similar structure -- primarily mapping among problems representable as a sparse submatrix signal plus a noise matrix, which are similar to the common hardness assumption of planted clique. The insight in this work is that a slight generalization of the planted clique conjecture -- secret leakage planted clique -- gives rise to a variety of new average-case reduction techniques, yielding a web of reductions among problems with very different structure. Using variants of the planted clique conjecture for specific forms of secret leakage planted clique, we deduce tight statistical-computational tradeoffs for a diverse range of problems including robust sparse mean estimation, mixtures of sparse linear regressions, robust sparse linear regression, tensor PCA, variants of dense $k$-block stochastic block models, negatively correlated sparse PCA, semirandom planted dense subgraph, detection in hidden partition models and a universality principle for learning sparse mixtures. In particular, a $k$-partite hypergraph variant of the planted clique conjecture is sufficient to establish all of our computational lower bounds. Our techniques also reveal novel connections to combinatorial designs and to random matrix theory. This work gives the first evidence that an expanded set of hardness assumptions, such as for secret leakage planted clique, may be a key first step towards a more complete theory of reductions among statistical problems.
LGFeb 1, 2020
A Corrective View of Neural Networks: Representation, Memorization and LearningGuy Bresler, Dheeraj Nagaraj
We develop a corrective mechanism for neural network approximation: the total available non-linear units are divided into multiple groups and the first group approximates the function under consideration, the second group approximates the error in approximation produced by the first group and corrects it, the third group approximates the error produced by the first and second groups together and so on. This technique yields several new representation and learning results for neural networks. First, we show that two-layer neural networks in the random features regime (RF) can memorize arbitrary labels for arbitrary points under under Euclidean distance separation condition using $\tilde{O}(n)$ ReLUs which is optimal in $n$ up to logarithmic factors. Next, we give a powerful representation result for two-layer neural networks with ReLUs and smoothed ReLUs which can achieve a squared error of at most $ε$ with $O(C(a,d)ε^{-1/(a+1)})$ for $a \in \mathbb{N}\cup\{0\}$ when the function is smooth enough (roughly when it has $Θ(ad)$ bounded derivatives). In certain cases $d$ can be replaced with effective dimension $q \ll d$. Previous results of this type implement Taylor series approximation using deep architectures. We also consider three-layer neural networks and show that the corrective mechanism yields faster representation rates for smooth radial functions. Lastly, we obtain the first $O(\mathrm{subpoly}(1/ε))$ upper bound on the number of neurons required for a two layer network to learn low degree polynomials up to squared error $ε$ via gradient descent. Even though deep networks can express these polynomials with $O(\mathrm{polylog}(1/ε))$ neurons, the best learning bounds on this problem require $\mathrm{poly}(1/ε)$ neurons.
CCAug 8, 2019
Average-Case Lower Bounds for Learning Sparse Mixtures, Robust Estimation and Semirandom AdversariesMatthew Brennan, Guy Bresler
This paper develops several average-case reduction techniques to show new hardness results for three central high-dimensional statistics problems, implying a statistical-computational gap induced by robustness, a detection-recovery gap and a universality principle for these gaps. A main feature of our approach is to map to these problems via a common intermediate problem that we introduce, which we call Imbalanced Sparse Gaussian Mixtures. We assume the planted clique conjecture for a version of the planted clique problem where the position of the planted clique is mildly constrained, and from this obtain the following computational lower bounds: (1) a $k$-to-$k^2$ statistical-computational gap for robust sparse mean estimation, providing the first average-case evidence for a conjecture of Li (2017) and Balakrishnan et al. (2017); (2) a tight lower bound for semirandom planted dense subgraph, which shows that a semirandom adversary shifts the detection threshold in planted dense subgraph to the conjectured recovery threshold; and (3) a universality principle for $k$-to-$k^2$ gaps in a broad class of sparse mixture problems that includes many natural formulations such as the spiked covariance model. Our main approach is to introduce several average-case techniques to produce structured and Gaussianized versions of an input graph problem, and then to rotate these high-dimensional Gaussians by matrices carefully constructed from hyperplanes in $\mathbb{F}_r^t$. For our universality result, we introduce a new method to perform an algorithmic change of measure tailored to sparse mixtures. We also provide evidence that the mild promise in our variant of planted clique does not change the complexity of the problem.
CCFeb 20, 2019
Optimal Average-Case Reductions to Sparse PCA: From Weak Assumptions to Strong HardnessMatthew Brennan, Guy Bresler
In the past decade, sparse principal component analysis has emerged as an archetypal problem for illustrating statistical-computational tradeoffs. This trend has largely been driven by a line of research aiming to characterize the average-case complexity of sparse PCA through reductions from the planted clique (PC) conjecture - which conjectures that there is no polynomial-time algorithm to detect a planted clique of size $K = o(N^{1/2})$ in $\mathcal{G}(N, \frac{1}{2})$. All previous reductions to sparse PCA either fail to show tight computational lower bounds matching existing algorithms or show lower bounds for formulations of sparse PCA other than its canonical generative model, the spiked covariance model. Also, these lower bounds all quickly degrade with the exponent in the PC conjecture. Specifically, when only given the PC conjecture up to $K = o(N^α)$ where $α< 1/2$, there is no sparsity level $k$ at which these lower bounds remain tight. If $α\le 1/3$ these reductions fail to even show the existence of a statistical-computational tradeoff at any sparsity $k$. We give a reduction from PC that yields the first full characterization of the computational barrier in the spiked covariance model, providing tight lower bounds at all sparsities $k$. We also show the surprising result that weaker forms of the PC conjecture up to clique size $K = o(N^α)$ for any given $α\in (0, 1/2]$ imply tight computational lower bounds for sparse PCA at sparsities $k = o(n^{α/3})$. This shows that even a mild improvement in the signal strength needed by the best known polynomial-time sparse PCA algorithms would imply that the hardness threshold for PC is subpolynomial. This is the first instance of a suboptimal hardness assumption implying optimal lower bounds for another problem in unsupervised learning.
STFeb 19, 2019
Universality of Computational Lower Bounds for Submatrix DetectionMatthew Brennan, Guy Bresler, Wasim Huleihel
In the general submatrix detection problem, the task is to detect the presence of a small $k \times k$ submatrix with entries sampled from a distribution $\mathcal{P}$ in an $n \times n$ matrix of samples from $\mathcal{Q}$. This formulation includes a number of well-studied problems, such as biclustering when $\mathcal{P}$ and $\mathcal{Q}$ are Gaussians and the planted dense subgraph formulation of community detection when the submatrix is a principal minor and $\mathcal{P}$ and $\mathcal{Q}$ are Bernoulli random variables. These problems all seem to exhibit a universal phenomenon: there is a statistical-computational gap depending on $\mathcal{P}$ and $\mathcal{Q}$ between the minimum $k$ at which this task can be solved and the minimum $k$ at which it can be solved in polynomial time. Our main result is to tightly characterize this computational barrier as a tradeoff between $k$ and the KL divergences between $\mathcal{P}$ and $\mathcal{Q}$ through average-case reductions from the planted clique conjecture. These computational lower bounds hold given mild assumptions on $\mathcal{P}$ and $\mathcal{Q}$ arising naturally from classical binary hypothesis testing. Our results recover and generalize the planted clique lower bounds for Gaussian biclustering in Ma-Wu (2015) and Brennan et al. (2018) and for the sparse and general regimes of planted dense subgraph in Hajek et al. (2015) and Brennan et al. (2018). This yields the first universality principle for computational lower bounds obtained through average-case reductions.
STNov 25, 2018
Sparse PCA from Sparse Linear RegressionGuy Bresler, Sung Min Park, Madalina Persu
Sparse Principal Component Analysis (SPCA) and Sparse Linear Regression (SLR) have a wide range of applications and have attracted a tremendous amount of attention in the last two decades as canonical examples of statistical problems in high dimension. A variety of algorithms have been proposed for both SPCA and SLR, but an explicit connection between the two had not been made. We show how to efficiently transform a black-box solver for SLR into an algorithm for SPCA: assuming the SLR solver satisfies prediction error guarantees achieved by existing efficient algorithms such as those based on the Lasso, the SPCA algorithm derived from it achieves near state of the art guarantees for testing and for support recovery for the single spiked covariance model as obtained by the current best polynomialtime algorithms. Our reduction not only highlights the inherent similarity between the two problems, but also, from a practical standpoint, allows one to obtain a collection of algorithms for SPCA directly from known algorithms for SLR. We provide experimental results on simulated data comparing our proposed framework to other algorithms for SPCA.
LGMay 25, 2018
Learning Restricted Boltzmann Machines via Influence MaximizationGuy Bresler, Frederic Koehler, Ankur Moitra et al.
Graphical models are a rich language for describing high-dimensional distributions in terms of their dependence structure. While there are algorithms with provable guarantees for learning undirected graphical models in a variety of settings, there has been much less progress in the important scenario when there are latent variables. Here we study Restricted Boltzmann Machines (or RBMs), which are a popular model with wide-ranging applications in dimensionality reduction, collaborative filtering, topic modeling, feature extraction and deep learning. The main message of our paper is a strong dichotomy in the feasibility of learning RBMs, depending on the nature of the interactions between variables: ferromagnetic models can be learned efficiently, while general models cannot. In particular, we give a simple greedy algorithm based on influence maximization to learn ferromagnetic RBMs with bounded degree. In fact, we learn a description of the distribution on the observed variables as a Markov Random Field. Our analysis is based on tools from mathematical physics that were developed to show the concavity of magnetization. Our algorithm extends straighforwardly to general ferromagnetic Ising models with latent variables. Conversely, we show that even for a contant number of latent variables with constant degree, without ferromagneticity the problem is as hard as sparse parity with noise. This hardness result is based on a sharp and surprising characterization of the representational power of bounded degree RBMs: the distribution on their observed variables can simulate any bounded order MRF. This result is of independent interest since RBMs are the building blocks of deep belief networks.
STFeb 17, 2018
Optimal Single Sample Tests for Structured versus Unstructured Network DataGuy Bresler, Dheeraj Nagaraj
We study the problem of testing, using only a single sample, between mean field distributions (like Curie-Weiss, Erdős-Rényi) and structured Gibbs distributions (like Ising model on sparse graphs and Exponential Random Graphs). Our goal is to test without knowing the parameter values of the underlying models: only the \emph{structure} of dependencies is known. We develop a new approach that applies to both the Ising and Exponential Random Graph settings based on a general and natural statistical test. The test can distinguish the hypotheses with high probability above a certain threshold in the (inverse) temperature parameter, and is optimal in that below the threshold no test can distinguish the hypotheses. The thresholds do not correspond to the presence of long-range order in the models. By aggregating information at a global scale, our test works even at very high temperatures. The proofs are based on distributional approximation and sharp concentration of quadratic forms, when restricted to Hamming spheres. The restriction to Hamming spheres is necessary, since otherwise any scalar statistic is useless without explicit knowledge of the temperature parameter. At the same time, this restriction radically changes the behavior of the functions under consideration, resulting in a much smaller variance than in the independent setting; this makes it hard to directly apply standard methods (i.e., Stein's method) for concentration of weakly dependent variables. Instead, we carry out an additional tensorization argument using a Markov chain that respects the symmetry of the Hamming sphere.
MLNov 6, 2017
Regret Bounds and Regimes of Optimality for User-User and Item-Item Collaborative FilteringGuy Bresler, Mina Karzand
We consider an online model for recommendation systems, with each user being recommended an item at each time-step and providing 'like' or 'dislike' feedback. Each user may be recommended a given item at most once. A latent variable model specifies the user preferences: both users and items are clustered into types. All users of a given type have identical preferences for the items, and similarly, items of a given type are either all liked or all disliked by a given user. We assume that the matrix encoding the preferences of each user type for each item type is randomly generated; in this way, the model captures structure in both the item and user spaces, the amount of structure depending on the number of each of the types. The measure of performance of the recommendation system is the expected number of disliked recommendations per user, defined as expected regret. We propose two algorithms inspired by user-user and item-item collaborative filtering (CF), modified to explicitly make exploratory recommendations, and prove performance guarantees in terms of their expected regret. For two regimes of model parameters, with structure only in item space or only in user space, we prove information-theoretic lower bounds on regret that match our upper bounds up to logarithmic factors. Our analysis elucidates system operating regimes in which existing CF algorithms are nearly optimal.
STApr 22, 2016
Learning a Tree-Structured Ising Model in Order to Make PredictionsGuy Bresler, Mina Karzand
We study the problem of learning a tree Ising model from samples such that subsequent predictions made using the model are accurate. The prediction task considered in this paper is that of predicting the values of a subset of variables given values of some other subset of variables. Virtually all previous work on graphical model learning has focused on recovering the true underlying graph. We define a distance ("small set TV" or ssTV) between distributions $P$ and $Q$ by taking the maximum, over all subsets $\mathcal{S}$ of a given size, of the total variation between the marginals of $P$ and $Q$ on $\mathcal{S}$; this distance captures the accuracy of the prediction task of interest. We derive non-asymptotic bounds on the number of samples needed to get a distribution (from the same class) with small ssTV relative to the one generating the samples. One of the main messages of this paper is that far fewer samples are needed than for recovering the underlying tree, which means that accurate predictions are possible using the wrong tree.
LGJul 20, 2015
Regret Guarantees for Item-Item Collaborative FilteringGuy Bresler, Devavrat Shah, Luis F. Voloch
There is much empirical evidence that item-item collaborative filtering works well in practice. Motivated to understand this, we provide a framework to design and analyze various recommendation algorithms. The setup amounts to online binary matrix completion, where at each time a random user requests a recommendation and the algorithm chooses an entry to reveal in the user's row. The goal is to minimize regret, or equivalently to maximize the number of +1 entries revealed at any time. We analyze an item-item collaborative filtering algorithm that can achieve fundamentally better performance compared to user-user collaborative filtering. The algorithm achieves good "cold-start" performance (appropriately defined) by quickly making good recommendations to new users about whom there is little information.
MLDec 3, 2014
Structure learning of antiferromagnetic Ising modelsGuy Bresler, David Gamarnik, Devavrat Shah
In this paper we investigate the computational complexity of learning the graph structure underlying a discrete undirected graphical model from i.i.d. samples. We first observe that the notoriously difficult problem of learning parities with noise can be captured as a special case of learning graphical models. This leads to an unconditional computational lower bound of $Ω(p^{d/2})$ for learning general graphical models on $p$ nodes of maximum degree $d$, for the class of so-called statistical algorithms recently introduced by Feldman et al (2013). The lower bound suggests that the $O(p^d)$ runtime required to exhaustively search over neighborhoods cannot be significantly improved without restricting the class of models. Aside from structural assumptions on the graph such as it being a tree, hypertree, tree-like, etc., many recent papers on structure learning assume that the model has the correlation decay property. Indeed, focusing on ferromagnetic Ising models, Bento and Montanari (2009) showed that all known low-complexity algorithms fail to learn simple graphs when the interaction strength exceeds a number related to the correlation decay threshold. Our second set of results gives a class of repelling (antiferromagnetic) models that have the opposite behavior: very strong interaction allows efficient learning in time $O(p^2)$. We provide an algorithm whose performance interpolates between $O(p^2)$ and $O(p^{d+2})$ depending on the strength of the repulsion.
LGNov 22, 2014
Efficiently learning Ising models on arbitrary graphsGuy Bresler
We consider the problem of reconstructing the graph underlying an Ising model from i.i.d. samples. Over the last fifteen years this problem has been of significant interest in the statistics, machine learning, and statistical physics communities, and much of the effort has been directed towards finding algorithms with low computational cost for various restricted classes of models. Nevertheless, for learning Ising models on general graphs with $p$ nodes of degree at most $d$, it is not known whether or not it is possible to improve upon the $p^{d}$ computation needed to exhaustively search over all possible neighborhoods for each node. In this paper we show that a simple greedy procedure allows to learn the structure of an Ising model on an arbitrary bounded-degree graph in time on the order of $p^2$. We make no assumptions on the parameters except what is necessary for identifiability of the model, and in particular the results hold at low-temperatures as well as for highly non-uniform models. The proof rests on a new structural property of Ising models: we show that for any node there exists at least one neighbor with which it has a high mutual information. This structural property may be of independent interest.
LGOct 31, 2014
A Latent Source Model for Online Collaborative FilteringGuy Bresler, George H. Chen, Devavrat Shah
Despite the prevalence of collaborative filtering in recommendation systems, there has been little theoretical development on why and how well it works, especially in the "online" setting, where items are recommended to users over time. We address this theoretical gap by introducing a model for online recommendation systems, cast item recommendation under the model as a learning problem, and analyze the performance of a cosine-similarity collaborative filtering method. In our model, each of $n$ users either likes or dislikes each of $m$ items. We assume there to be $k$ types of users, and all the users of a given type share a common string of probabilities determining the chance of liking each item. At each time step, we recommend an item to each user, where a key distinction from related bandit literature is that once a user consumes an item (e.g., watches a movie), then that item cannot be recommended to the same user again. The goal is to maximize the number of likable items recommended to users over time. Our main result establishes that after nearly $\log(km)$ initial learning time steps, a simple collaborative filtering algorithm achieves essentially optimal performance without knowing $k$. The algorithm has an exploitation step that uses cosine similarity and two types of exploration steps, one to explore the space of items (standard in the literature) and the other to explore similarity between users (novel to this work).
LGOct 28, 2014
Learning graphical models from the Glauber dynamicsGuy Bresler, David Gamarnik, Devavrat Shah
In this paper we consider the problem of learning undirected graphical models from data generated according to the Glauber dynamics. The Glauber dynamics is a Markov chain that sequentially updates individual nodes (variables) in a graphical model and it is frequently used to sample from the stationary distribution (to which it converges given sufficient time). Additionally, the Glauber dynamics is a natural dynamical model in a variety of settings. This work deviates from the standard formulation of graphical model learning in the literature, where one assumes access to i.i.d. samples from the distribution. Much of the research on graphical model learning has been directed towards finding algorithms with low computational cost. As the main result of this work, we establish that the problem of reconstructing binary pairwise graphical models is computationally tractable when we observe the Glauber dynamics. Specifically, we show that a binary pairwise graphical model on $p$ nodes with maximum degree $d$ can be learned in time $f(d)p^2\log p$, for a function $f(d)$, using nearly the information-theoretic minimum number of samples.
CCSep 12, 2014
Hardness of parameter estimation in graphical modelsGuy Bresler, David Gamarnik, Devavrat Shah
We consider the problem of learning the canonical parameters specifying an undirected graphical model (Markov random field) from the mean parameters. For graphical models representing a minimal exponential family, the canonical parameters are uniquely determined by the mean parameters, so the problem is feasible in principle. The goal of this paper is to investigate the computational feasibility of this statistical task. Our main result shows that parameter estimation is in general intractable: no algorithm can learn the canonical parameters of a generic pair-wise binary graphical model from the mean parameters in time bounded by a polynomial in the number of variables (unless RP = NP). Indeed, such a result has been believed to be true (see the monograph by Wainwright and Jordan (2008)) but no proof was known. Our proof gives a polynomial time reduction from approximating the partition function of the hard-core model, known to be hard, to learning approximate parameters. Our reduction entails showing that the marginal polytope boundary has an inherent repulsive property, which validates an optimization procedure over the polytope that does not use any knowledge of its structure (as required by the ellipsoid method and others).