Santosh Vempala

h-index53

42papers

1,941citations

Novelty63%

AI Score48

Ranked #30,048 of 194,257 authors (top 15%)#88 in DS (top 18%)

42 Papers

20.2CLNov 24, 2023

Calibrated Language Models Must Hallucinate

Adam Tauman Kalai, Santosh S. Vempala

Recent language models generate false but plausible-sounding text with surprising frequency. Such "hallucinations" are an obstacle to the usability of language-based AI systems and can harm people who rely upon their outputs. This work shows that there is an inherent statistical lower-bound on the rate that pretrained language models hallucinate certain types of facts, having nothing to do with the transformer LM architecture or data quality. For "arbitrary" facts whose veracity cannot be determined from the training data, we show that hallucinations must occur at a certain rate for language models that satisfy a statistical calibration condition appropriate for generative language models. Specifically, if the maximum probability of any fact is bounded, we show that the probability of generating a hallucination is close to the fraction of facts that occur exactly once in the training data (a "Good-Turing" estimate), even assuming ideal training data without errors. One conclusion is that models pretrained to be sufficiently good predictors (i.e., calibrated) may require post-training to mitigate hallucinations on the type of arbitrary facts that tend to appear once in the training set. However, our analysis also suggests that there is no statistical reason that pretraining will lead to hallucination on facts that tend to appear more than once in the training data (like references to publications such as articles and books, whose hallucinations have been particularly notable and problematic) or on systematic facts (like arithmetic calculations). Therefore, different architectures and learning algorithms may mitigate these latter types of hallucinations.

15.6LGApr 22, 2022

Convergence of the Riemannian Langevin Algorithm

Khashayar Gatmiry, Santosh S. Vempala

We study the Riemannian Langevin Algorithm for the problem of sampling from a distribution with density $ν$ with respect to the natural measure on a manifold with metric $g$. We assume that the target density satisfies a log-Sobolev inequality with respect to the metric and prove that the manifold generalization of the Unadjusted Langevin Algorithm converges rapidly to $ν$ for Hessian manifolds. This allows us to reduce the problem of sampling non-smooth (constrained) densities in ${\bf R}^n$ to sampling smooth densities over appropriate manifolds, while needing access only to the gradient of the log-density, and this, in turn, to sampling from the natural Brownian motion on the manifold. Our main analytic tools are (1) an extension of self-concordance to manifolds, and (2) a stochastic approach to bounding smoothness on manifolds. A special case of our approach is sampling isoperimetric densities restricted to polytopes by using the metric defined by the logarithmic barrier.

8.6DSOct 13, 2022

Condition-number-independent convergence rate of Riemannian Hamiltonian Monte Carlo with numerical integrators

Yunbum Kook, Yin Tat Lee, Ruoqi Shen et al.

We study the convergence rate of discretized Riemannian Hamiltonian Monte Carlo on sampling from distributions in the form of $e^{-f(x)}$ on a convex body $\mathcal{M}\subset\mathbb{R}^{n}$. We show that for distributions in the form of $e^{-α^{\top}x}$ on a polytope with $m$ constraints, the convergence rate of a family of commonly-used integrators is independent of $\left\Vert α\right\Vert _{2}$ and the geometry of the polytope. In particular, the implicit midpoint method (IMM) and the generalized Leapfrog method (LM) have a mixing time of $\widetilde{O}\left(mn^{3}\right)$ to achieve $ε$ total variation distance to the target distribution. These guarantees are based on a general bound on the convergence rate for densities of the form $e^{-f(x)}$ in terms of parameters of the manifold and the integrator. Our theoretical guarantee complements the empirical results of [KLSV22], which shows that RHMC with IMM can sample ill-conditioned, non-smooth and constrained distributions in very high dimension efficiently in practice.

10.3DSJun 22, 2022

Constant-Factor Approximation Algorithms for Socially Fair $k$-Clustering

Mehrdad Ghadiri, Mohit Singh, Santosh S. Vempala

We study approximation algorithms for the socially fair $(\ell_p, k)$-clustering problem with $m$ groups, whose special cases include the socially fair $k$-median ($p=1$) and socially fair $k$-means ($p=2$) problems. We present (1) a polynomial-time $(5+2\sqrt{6})^p$-approximation with at most $k+m$ centers (2) a $(5+2\sqrt{6}+ε)^p$-approximation with $k$ centers in time $n^{2^{O(p)}\cdot m^2}$, and (3) a $(15+6\sqrt{6})^p$ approximation with $k$ centers in time $k^{m}\cdot\text{poly}(n)$. The first result is obtained via a refinement of the iterative rounding method using a sequence of linear programs. The latter two results are obtained by converting a solution with up to $k+m$ centers to one with $k$ centers using sparsification methods for (2) and via an exhaustive search for (3). We also compare the performance of our algorithms with existing bicriteria algorithms as well as exactly $k$ center approximation algorithms on benchmark datasets, and find that our algorithms also outperform existing methods in practice.

9.6NEJun 6, 2023

Computation with Sequences in a Model of the Brain

Max Dabagia, Christos H. Papadimitriou, Santosh S. Vempala

Even as machine learning exceeds human-level performance on many applications, the generality, robustness, and rapidity of the brain's learning capabilities remain unmatched. How cognition arises from neural activity is a central open question in neuroscience, inextricable from the study of intelligence itself. A simple formal model of neural activity was proposed in Papadimitriou [2020] and has been subsequently shown, through both mathematical proofs and simulations, to be capable of implementing certain simple cognitive operations via the creation and manipulation of assemblies of neurons. However, many intelligent behaviors rely on the ability to recognize, store, and manipulate temporal sequences of stimuli (planning, language, navigation, to list a few). Here we show that, in the same model, time can be captured naturally as precedence through synaptic weights and plasticity, and, as a result, a range of computations on sequences of assemblies can be carried out. In particular, repeated presentation of a sequence of stimuli leads to the memorization of the sequence through corresponding neural assemblies: upon future presentation of any stimulus in the sequence, the corresponding assembly and its subsequent ones will be activated, one after the other, until the end of the sequence. Finally, we show that any finite state machine can be learned in a similar way, through the presentation of appropriate patterns of sequences. Through an extension of this mechanism, the model can be shown to be capable of universal computation. We support our analysis with a number of experiments to probe the limits of learning in this model in key ways. Taken together, these results provide a concrete hypothesis for the basis of the brain's remarkable abilities to compute and learn, with sequences playing a vital role.

5.1DSFeb 23, 2023

Beyond Moments: Robustly Learning Affine Transformations with Asymptotically Optimal Error

He Jia, Pravesh K . Kothari, Santosh S. Vempala

We present a polynomial-time algorithm for robustly learning an unknown affine transformation of the standard hypercube from samples, an important and well-studied setting for independent component analysis (ICA). Specifically, given an $ε$-corrupted sample from a distribution $D$ obtained by applying an unknown affine transformation $x \rightarrow Ax+s$ to the uniform distribution on a $d$-dimensional hypercube $[-1,1]^d$, our algorithm constructs $\hat{A}, \hat{s}$ such that the total variation distance of the distribution $\hat{D}$ from $D$ is $O(ε)$ using poly$(d)$ time and samples. Total variation distance is the information-theoretically strongest possible notion of distance in our setting and our recovery guarantees in this distance are optimal up to the absolute constant factor multiplying $ε$. In particular, if the columns of $A$ are normalized to be unit length, our total variation distance guarantee implies a bound on the sum of the $\ell_2$ distances between the column vectors of $A$ and $A'$, $\sum_{i =1}^d \|a_i-\hat{a}_i\|_2 = O(ε)$. In contrast, the strongest known prior results only yield a $ε^{O(1)}$ (relative) bound on the distance between individual $a_i$'s and their estimates and translate into an $O(dε)$ bound on the total variation distance. Our key innovation is a new approach to ICA (even to outlier-free ICA) that circumvents the difficulties in the classical method of moments and instead relies on a new geometric certificate of correctness of an affine transformation. Our algorithm is based on a new method that iteratively improves an estimate of the unknown affine transformation whenever the requirements of the certificate are not met.

3.8LGNov 2, 2023

Contrastive Moments: Unsupervised Halfspace Learning in Polynomial Time

Xinyuan Cao, Santosh S. Vempala

We give a polynomial-time algorithm for learning high-dimensional halfspaces with margins in $d$-dimensional space to within desired TV distance when the ambient distribution is an unknown affine transformation of the $d$-fold product of an (unknown) symmetric one-dimensional logconcave distribution, and the halfspace is introduced by deleting at least an $ε$ fraction of the data in one of the component distributions. Notably, our algorithm does not need labels and establishes the unique (and efficient) identifiability of the hidden halfspace under this distributional assumption. The sample and time complexity of the algorithm are polynomial in the dimension and $1/ε$. The algorithm uses only the first two moments of suitable re-weightings of the empirical distribution, which we call contrastive moments; its analysis uses classical facts about generalized Dirichlet polynomials and relies crucially on a new monotonicity property of the moment ratio of truncations of logconcave distributions. Such algorithms, based only on first and second moments were suggested in earlier work, but hitherto eluded rigorous guarantees. Prior work addressed the special case when the underlying distribution is Gaussian via Non-Gaussian Component Analysis. We improve on this by providing polytime guarantees based on Total Variation (TV) distance, in place of existing moment-bound guarantees that can be super-polynomial. Our work is also the first to go beyond Gaussians in this setting.

4.3DSJul 24, 2023

Gaussian Cooling and Dikin Walks: The Interior-Point Method for Logconcave Sampling

Yunbum Kook, Santosh S. Vempala

The connections between (convex) optimization and (logconcave) sampling have been considerably enriched in the past decade with many conceptual and mathematical analogies. For instance, the Langevin algorithm can be viewed as a sampling analogue of gradient descent and has condition-number-dependent guarantees on its performance. In the early 1990s, Nesterov and Nemirovski developed the Interior-Point Method (IPM) for convex optimization based on self-concordant barriers, providing efficient algorithms for structured convex optimization, often faster than the general method. This raises the following question: can we develop an analogous IPM for structured sampling problems? In 2012, Kannan and Narayanan proposed the Dikin walk for uniformly sampling polytopes, and an improved analysis was given in 2020 by Laddha-Lee-Vempala. The Dikin walk uses a local metric defined by a self-concordant barrier for linear constraints. Here we generalize this approach by developing and adapting IPM machinery together with the Dikin walk for poly-time sampling algorithms. Our IPM-based sampling framework provides an efficient warm start and goes beyond uniform distributions and linear constraints. We illustrate the approach on important special cases, in particular giving the fastest algorithms to sample uniform, exponential, or Gaussian distributions on a truncated PSD cone. The framework is general and can be applied to other sampling algorithms.

12.1PRMar 24

The Localization Method for High-Dimensional Inequalities

Yunbum Kook, Santosh S. Vempala

We survey the localization method for proving inequalities in high dimension, pioneered by LovÃ¡sz and Simonovits (1993), and its stochastic extension developed by Eldan (2012). The method has found applications in a surprising wide variety of settings, ranging from its original motivation in isoperimetric inequalities to optimization, concentration of measure, and bounding the mixing rate of Markov chains. At heart, the method converts a given instance of an inequality (for a set or distribution in high dimension) into a highly structured instance, often just one-dimensional.

10.8DSMay 2, 2024

In-and-Out: Algorithmic Diffusion for Sampling Convex Bodies

Yunbum Kook, Santosh S. Vempala, Matthew S. Zhang

We present a new random walk for uniformly sampling high-dimensional convex bodies. It achieves state-of-the-art runtime complexity with stronger guarantees on the output than previously known, namely in Rényi divergence (which implies TV, $\mathcal{W}_2$, KL, $χ^2$). The proof departs from known approaches for polytime algorithms for the problem -- we utilize a stochastic diffusion perspective to show contraction to the target distribution with the rate of convergence determined by functional isoperimetric constants of the target distribution.

9.7DSNov 20, 2024

Sampling and Integration of Logconcave Functions by Algorithmic Diffusion

Yunbum Kook, Santosh S. Vempala

We study the complexity of sampling, rounding, and integrating arbitrary logconcave functions. Our new approach provides the first complexity improvements in nearly two decades for general logconcave functions for all three problems, and matches the best-known complexities for the special case of uniform distributions on convex bodies. For the sampling problem, our output guarantees are significantly stronger than previously known, and lead to a streamlined analysis of statistical estimation based on dependent random samples.

7.3DSMay 3, 2025

Faster logconcave sampling from a cold start in high dimension

Yunbum Kook, Santosh S. Vempala

We present a faster algorithm to generate a warm start for sampling an arbitrary logconcave density specified by an evaluation oracle, leading to the first sub-cubic sampling algorithms for inputs in (near-)isotropic position. A long line of prior work incurred a warm-start penalty of at least linear in the dimension, hitting a cubic barrier, even for the special case of uniform sampling from convex bodies. Our improvement relies on two key ingredients of independent interest. (1) We show how to sample given a warm start in weaker notions of distance, in particular $q$-Rényi divergence for $q=\widetilde{\mathcal{O}}(1)$, whereas previous analyses required stringent $\infty$-Rényi divergence (with the exception of Hit-and-Run, whose known mixing time is higher). This marks the first improvement in the required warmness since Lovász and Simonovits (1991). (2) We refine and generalize the log-Sobolev inequality of Lee and Vempala (2018), originally established for isotropic logconcave distributions in terms of the diameter of the support, to logconcave distributions in terms of a geometric average of the support diameter and the largest eigenvalue of the covariance matrix.

4.2AIJun 20, 2024

Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI's Understanding of Algorithms

Mirabel Reid, Santosh S. Vempala

As Large Language Models (LLMs) perform (and sometimes excel at) more and more complex cognitive tasks, a natural question is whether AI really understands. The study of understanding in LLMs is in its infancy, and the community has yet to incorporate well-trodden research in philosophy, psychology, and education. We initiate this, specifically focusing on understanding algorithms, and propose a hierarchy of levels of understanding. We use the hierarchy to design and conduct a study with human subjects (undergraduate and graduate students) as well as large language models (generations of GPT), revealing interesting similarities and differences. We expect that our rigorous criteria will be useful to keep track of AI's progress in such cognitive domains.

7.3NENov 17, 2021

How and When Random Feedback Works: A Case Study of Low-Rank Matrix Factorization

Shivam Garg, Santosh S. Vempala

The success of gradient descent in ML and especially for learning neural networks is remarkable and robust. In the context of how the brain learns, one aspect of gradient descent that appears biologically difficult to realize (if not implausible) is that its updates rely on feedback from later layers to earlier layers through the same connections. Such bidirected links are relatively few in brain networks, and even when reciprocal connections exist, they may not be equi-weighted. Random Feedback Alignment (Lillicrap et al., 2016), where the backward weights are random and fixed, has been proposed as a bio-plausible alternative and found to be effective empirically. We investigate how and when feedback alignment (FA) works, focusing on one of the most basic problems with layered structure -- low-rank matrix factorization. In this problem, given a matrix $Y_{n\times m}$, the goal is to find a low rank factorization $Z_{n \times r}W_{r \times m}$ that minimizes the error $\|ZW-Y\|_F$. Gradient descent solves this problem optimally. We show that FA converges to the optimal solution when $r\ge \mbox{rank}(Y)$. We also shed light on how FA works. It is observed empirically that the forward weight matrices and (random) feedback matrices come closer during FA updates. Our analysis rigorously derives this phenomenon and shows how it facilitates convergence of FA*, a closely related variant of FA. We also show that FA can be far from optimal when $r < \mbox{rank}(Y)$. This is the first provable separation result between gradient descent and FA. Moreover, the representations found by gradient descent and FA can be almost orthogonal even when their error $\|ZW-Y\|_F$ is approximately equal. As a corollary, these results also hold for training two-layer linear neural networks when the training input is isotropic, and the output is a linear function of the input.

12.5LGOct 27, 2021

Provable Lifelong Learning of Representations

Xinyuan Cao, Weiyang Liu, Santosh S. Vempala

In lifelong learning, tasks (or classes) to be learned arrive sequentially over time in arbitrary order. During training, knowledge from previous tasks can be captured and transferred to subsequent ones to improve sample efficiency. We consider the setting where all target tasks can be represented in the span of a small number of unknown linear or nonlinear features of the input data. We propose a lifelong learning algorithm that maintains and refines the internal feature representation. We prove that for any desired accuracy on all tasks, the dimension of the representation remains close to that of the underlying representation. The resulting sample complexity improves significantly on existing bounds. In the setting of linear features, our algorithm is provably efficient and the sample complexity for input dimension $d$, $m$ tasks with $k$ features up to error $ε$ is $\tilde{O}(dk^{1.5}/ε+km/ε)$. We also prove a matching lower bound for any lifelong learning algorithm that uses a single task learner as a black box. We complement our analysis with an empirical study, including a heuristic lifelong learning algorithm for deep neural networks. Our method performs favorably on challenging realistic image datasets compared to state-of-the-art continual learning methods.

11.7NEOct 7, 2021Code

Assemblies of neurons learn to classify well-separated distributions

Max Dabagia, Christos H. Papadimitriou, Santosh S. Vempala

An assembly is a large population of neurons whose synchronous firing is hypothesized to represent a memory, concept, word, and other cognitive categories. Assemblies are believed to provide a bridge between high-level cognitive phenomena and low-level neural activity. Recently, a computational system called the Assembly Calculus (AC), with a repertoire of biologically plausible operations on assemblies, has been shown capable of simulating arbitrary space-bounded computation, but also of simulating complex cognitive phenomena such as language, reasoning, and planning. However, the mechanism whereby assemblies can mediate learning has not been known. Here we present such a mechanism, and prove rigorously that, for simple classification problems defined on distributions of labeled assemblies, a new assembly representing each class can be reliably formed in response to a few stimuli from the class; this assembly is henceforth reliably recalled in response to new stimuli from the same class. Furthermore, such class assemblies will be distinguishable as long as the respective classes are reasonably separated -- for example, when they are clusters of similar assemblies. To prove these results, we draw on random graph theory with dynamic edge weights to estimate sequences of activated vertices, yielding strong generalizations of previous calculations and theorems in this field over the past five years. These theorems are backed up by experiments demonstrating the successful formation of assemblies which represent concept classes on synthetic data drawn from such distributions, and also on MNIST, which lends itself to classification through one assembly per digit. Seen as a learning algorithm, this mechanism is entirely online, generalizes from very few samples, and requires only mild supervision -- all key attributes of learning in a model of the brain.

17.3DSSep 24, 2021

The Mirror Langevin Algorithm Converges with Vanishing Bias

Ruilin Li, Molei Tao, Santosh S. Vempala et al.

The technique of modifying the geometry of a problem from Euclidean to Hessian metric has proved to be quite effective in optimization, and has been the subject of study for sampling. The Mirror Langevin Diffusion (MLD) is a sampling analogue of mirror flow in continuous time, and it has nice convergence properties under log-Sobolev or Poincare inequalities relative to the Hessian metric, as shown by Chewi et al. (2020). In discrete time, a simple discretization of MLD is the Mirror Langevin Algorithm (MLA) studied by Zhang et al. (2020), who showed a biased convergence bound with a non-vanishing bias term (does not go to zero as step size goes to zero). This raised the question of whether we need a better analysis or a better discretization to achieve a vanishing bias. Here we study the basic Mirror Langevin Algorithm and show it indeed has a vanishing bias. We apply mean-square analysis based on Li et al. (2019) and Li et al. (2021) to show the mixing time bound for MLA under the modified self-concordance condition introduced by Zhang et al. (2020).

20.8DSDec 3, 2020

Robustly Learning Mixtures of $k$ Arbitrary Gaussians

Ainesh Bakshi, Ilias Diakonikolas, He Jia et al.

We give a polynomial-time algorithm for the problem of robustly estimating a mixture of $k$ arbitrary Gaussians in $\mathbb{R}^d$, for any fixed $k$, in the presence of a constant fraction of arbitrary corruptions. This resolves the main open problem in several previous works on algorithmic robust statistics, which addressed the special cases of robustly estimating (a) a single Gaussian, (b) a mixture of TV-distance separated Gaussians, and (c) a uniform mixture of two Gaussians. Our main tools are an efficient \emph{partial clustering} algorithm that relies on the sum-of-squares method, and a novel \emph{tensor decomposition} algorithm that allows errors in both Frobenius norm and low-rank terms.

20.2LGJun 17, 2020Code

Socially Fair k-Means Clustering

Mehrdad Ghadiri, Samira Samadi, Santosh Vempala

We show that the popular k-means clustering algorithm (Lloyd's heuristic), used for a variety of scientific data, can result in outcomes that are unfavorable to subgroups of data (e.g., demographic groups). Such biased clusterings can have deleterious implications for human-centric applications such as resource allocation. We present a fair k-means objective and algorithm to choose cluster centers that provide equitable costs for different groups. The algorithm, Fair-Lloyd, is a modification of Lloyd's heuristic for k-means, inheriting its simplicity, efficiency, and stability. In comparison with standard Lloyd's, we find that on benchmark datasets, Fair-Lloyd exhibits unbiased performance by ensuring that all groups have equal costs in the output k-clustering, while incurring a negligible increase in running time, thus making it a viable fair option wherever k-means is currently used.

6.6DSNov 26, 2019

Robustly Clustering a Mixture of Gaussians

He Jia, Santosh Vempala

We give an efficient algorithm for robustly clustering of a mixture of two arbitrary Gaussians, a central open problem in the theory of computationally efficient robust estimation, assuming only that the the means of the component Gaussians are well-separated or their covariances are well-separated. Our algorithm and analysis extend naturally to robustly clustering mixtures of well-separated strongly logconcave distributions. The mean separation required is close to the smallest possible to guarantee that most of the measure of each component can be separated by some hyperplane (for covariances, it is the same condition in the second degree polynomial kernel). We also show that for Gaussian mixtures, separation in total variation distance suffices to achieve robust clustering. Our main tools are a new identifiability criterion based on isotropic position and the Fisher discriminant, and a corresponding Sum-of-Squares convex programming relaxation, of fixed degree.

11.7DSJun 13, 2019

The Communication Complexity of Optimization

Santosh S. Vempala, Ruosong Wang, David P. Woodruff

We consider the communication complexity of a number of distributed optimization problems. We start with the problem of solving a linear system. Suppose there is a coordinator together with $s$ servers $P_1, \ldots, P_s$, the $i$-th of which holds a subset $A^{(i)} x = b^{(i)}$ of $n_i$ constraints of a linear system in $d$ variables, and the coordinator would like to output $x \in \mathbb{R}^d$ for which $A^{(i)} x = b^{(i)}$ for $i = 1, \ldots, s$. We assume each coefficient of each constraint is specified using $L$ bits. We first resolve the randomized and deterministic communication complexity in the point-to-point model of communication, showing it is $\tildeΘ(d^2L + sd)$ and $\tildeΘ(sd^2L)$, respectively. We obtain similar results for the blackboard model. When there is no solution to the linear system, a natural alternative is to find the solution minimizing the $\ell_p$ loss. While this problem has been studied, we give improved upper or lower bounds for every value of $p \ge 1$. One takeaway message is that sampling and sketching techniques, which are commonly used in earlier work on distributed optimization, are neither optimal in the dependence on $d$ nor on the dependence on the approximation $ε$, thus motivating new techniques from optimization to solve these problems. Towards this end, we consider the communication complexity of optimization tasks which generalize linear systems. For linear programming, we first resolve the communication complexity when $d$ is constant, showing it is $\tildeΘ(sL)$ in the point-to-point model. For general $d$ and in the point-to-point model, we show an $\tilde{O}(sd^3 L)$ upper bound and an $\tildeΩ(d^2 L + sd)$ lower bound. We also show if one perturbs the coefficients randomly by numbers as small as $2^{-Θ(L)}$, then the upper bound is $\tilde{O}(sd^2 L) + \textrm{poly}(dL)$.

2.7CRMay 31, 2019

Human-Usable Password Schemas: Beyond Information-Theoretic Security

Elan Rosenfeld, Santosh Vempala, Manuel Blum

Password users frequently employ passwords that are too simple, or they just reuse passwords for multiple websites. A common complaint is that utilizing secure passwords is too difficult. One possible solution to this problem is to use a password schema. Password schemas are deterministic functions which map challenges (typically the website name) to responses (passwords). Previous work has been done on developing and analyzing publishable schemas, but these analyses have been information-theoretic, not complexity-theoretic; they consider an adversary with infinite computing power. We perform an analysis with respect to adversaries having currently achievable computing capabilities, assessing the realistic practical security of such schemas. We prove for several specific schemas that a computer is no worse off than an infinite adversary and that it can successfully extract all information from leaked challenges and their respective responses, known as challenge-response pairs. We also show that any schema that hopes to be secure against adversaries with bounded computation should obscure information in a very specific way, by introducing many possible constraints with each challenge-response pair. These surprising results put the analyses of password schemas on a more solid and practical footing.

18.1DSMay 7, 2019

Optimal Convergence Rate of Hamiltonian Monte Carlo for Strongly Logconcave Distributions

Zongchen Chen, Santosh S. Vempala

We study Hamiltonian Monte Carlo (HMC) for sampling from a strongly logconcave density proportional to $e^{-f}$ where $f:\mathbb{R}^d \to \mathbb{R}$ is $μ$-strongly convex and $L$-smooth (the condition number is $κ= L/μ$). We show that the relaxation time (inverse of the spectral gap) of ideal HMC is $O(κ)$, improving on the previous best bound of $O(κ^{1.5})$; we complement this with an example where the relaxation time is $Ω(κ)$. When implemented using a nearly optimal ODE solver, HMC returns an $\varepsilon$-approximate point in $2$-Wasserstein distance using $\widetilde{O}((κd)^{0.5} \varepsilon^{-1})$ gradient evaluations per step and $\widetilde{O}((κd)^{1.5}\varepsilon^{-1})$ total time.

33.4DSMar 20, 2019

Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices

Santosh S. Vempala, Andre Wibisono

We study the Unadjusted Langevin Algorithm (ULA) for sampling from a probability distribution $ν= e^{-f}$ on $\mathbb{R}^n$. We prove a convergence guarantee in Kullback-Leibler (KL) divergence assuming $ν$ satisfies a log-Sobolev inequality and the Hessian of $f$ is bounded. Notably, we do not assume convexity or bounds on higher derivatives. We also prove convergence guarantees in Rényi divergence of order $q > 1$ assuming the limit of ULA satisfies either the log-Sobolev or Poincaré inequality. We also prove a bound on the bias of the limiting distribution of ULA assuming third-order smoothness of $f$, without requiring isoperimetry.

16.1DMFeb 28, 2019Code

Multi-Criteria Dimensionality Reduction with Applications to Fairness

Uthaipon Tantipongpipat, Samira Samadi, Mohit Singh et al.

Dimensionality reduction is a classical technique widely used for data analysis. One foundational instantiation is Principal Component Analysis (PCA), which minimizes the average reconstruction error. In this paper, we introduce the "multi-criteria dimensionality reduction" problem where we are given multiple objectives that need to be optimized simultaneously. As an application, our model captures several fairness criteria for dimensionality reduction such as our novel Fair-PCA problem and the Nash Social Welfare (NSW) problem. In Fair-PCA, the input data is divided into $k$ groups, and the goal is to find a single $d$-dimensional representation for all groups for which the minimum variance of any one group is maximized. In NSW, the goal is to maximize the product of the individual variances of the groups achieved by the common low-dimensional space. Our main result is an exact polynomial-time algorithm for the two-criterion dimensionality reduction problem when the two criteria are increasing concave functions. As an application of this result, we obtain a polynomial time algorithm for Fair-PCA for $k=2$ groups and a polynomial time algorithm for NSW objective for $k=2$ groups. We also give approximation algorithms for $k>2$. Our technical contribution in the above results is to prove new low-rank properties of extreme point solutions to semi-definite programs. We conclude with experiments indicating the effectiveness of algorithms based on extreme point solutions of semi-definite programs on several real-world data sets.

16.1DSDec 15, 2018

Algorithmic Theory of ODEs and Sampling from Well-conditioned Logconcave Densities

Yin Tat Lee, Zhao Song, Santosh S. Vempala

Sampling logconcave functions arising in statistics and machine learning has been a subject of intensive study. Recent developments include analyses for Langevin dynamics and Hamiltonian Monte Carlo (HMC). While both approaches have dimension-independent bounds for the underlying $\mathit{continuous}$ processes under sufficiently strong smoothness conditions, the resulting discrete algorithms have complexity and number of function evaluations growing with the dimension. Motivated by this problem, in this paper, we give a general algorithm for solving multivariate ordinary differential equations whose solution is close to the span of a known basis of functions (e.g., polynomials or piecewise polynomials). The resulting algorithm has polylogarithmic depth and essentially tight runtime - it is nearly linear in the size of the representation of the solution. We apply this to the sampling problem to obtain a nearly linear implementation of HMC for a broad class of smooth, strongly logconcave densities, with the number of iterations (parallel depth) and gradient evaluations being $\mathit{polylogarithmic}$ in the dimension (rather than polynomial as in previous work). This class includes the widely-used loss function for logistic regression with incoherent weight matrices and has been subject of much study recently. We also give a faster algorithm with $ \mathit{polylogarithmic~depth}$ for the more general and standard class of strongly convex functions with Lipschitz gradient. These results are based on (1) an improved contraction bound for the exact HMC process and (2) logarithmic bounds on the degree of polynomials that approximate solutions of the differential equations arising in implementing HMC.

22.7LGOct 31, 2018Code

The Price of Fair PCA: One Extra Dimension

Samira Samadi, Uthaipon Tantipongpipat, Jamie Morgenstern et al.

We investigate whether the standard dimensionality reduction technique of PCA inadvertently produces data representations with different fidelity for two different populations. We show on several real-world data sets, PCA has higher reconstruction error on population A than on B (for example, women versus men or lower- versus higher-educated individuals). This can happen even when the data set has a similar number of samples from A and B. This motivates our study of dimensionality reduction techniques which maintain similar fidelity for A and B. We define the notion of Fair PCA and give a polynomial-time algorithm for finding a low dimensional representation of the data which is nearly-optimal with respect to this measure. Finally, we show on real-world data sets that our algorithm can be used to efficiently generate a fair low dimensional representation of the data.

4.3DSOct 28, 2018

Smoothed Analysis of Discrete Tensor Decomposition and Assemblies of Neurons

Nima Anari, Constantinos Daskalakis, Wolfgang Maass et al.

We analyze linear independence of rank one tensors produced by tensor powers of randomly perturbed vectors. This enables efficient decomposition of sums of high-order tensors. Our analysis builds upon [BCMV14] but allows for a wider range of perturbation models, including discrete ones. We give an application to recovering assemblies of neurons. Assemblies are large sets of neurons representing specific memories or concepts. The size of the intersection of two assemblies has been shown in experiments to represent the extent to which these memories co-occur or these concepts are related; the phenomenon is called association of assemblies. This suggests that an animal's memory is a complex web of associations, and poses the problem of recovering this representation from cognitive data. Motivated by this problem, we study the following more general question: Can we reconstruct the Venn diagram of a family of sets, given the sizes of their $\ell$-wise intersections? We show that as long as the family of sets is randomly perturbed, it is enough for the number of measurements to be polynomially larger than the number of nonempty regions of the Venn diagram to fully reconstruct the diagram.

20.4LGMay 7, 2018

Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds

Santosh Vempala, John Wilmes

We study the complexity of training neural network models with one hidden nonlinear activation layer and an output weighted sum layer. We analyze Gradient Descent applied to learning a bounded target function on $n$ real-valued inputs. We give an agnostic learning guarantee for GD: starting from a randomly initialized network, it converges in mean squared loss to the minimum error (in $2$-norm) of the best approximation of the target function using a polynomial of degree at most $k$. Moreover, for any $k$, the size of the network and number of iterations needed are both bounded by $n^{O(k)}\log(1/ε)$. In particular, this applies to training networks of unbiased sigmoids and ReLUs. We also rigorously explain the empirical finding that gradient descent discovers lower frequency Fourier components before higher frequency components. We complement this result with nearly matching lower bounds in the Statistical Query model. GD fits well in the SQ framework since each training step is determined by an expectation over the input distribution. We show that any SQ algorithm that achieves significant improvement over a constant function with queries of tolerance some inverse polynomial in the input dimensionality $n$ must use $n^{Ω(k)}$ queries even when the target functions are restricted to a set of $n^{O(k)}$ degree-$k$ polynomials, and the input distribution is uniform over the unit sphere; for this class the information-theoretic lower bound is only $Θ(k \log n)$. Our approach for both parts is based on spherical harmonics. We view gradient descent as an operator on the space of functions, and study its dynamics. An essential tool is the Funk-Hecke theorem, which explains the eigenfunctions of this operator in the case of the mean squared loss.

5.7HCDec 11, 2017

Usability of Humanly Computable Passwords

Samira Samadi, Santosh Vempala, Adam Tauman Kalai

Reusing passwords across multiple websites is a common practice that compromises security. Recently, Blum and Vempala have proposed password strategies to help people calculate, in their heads, passwords for different sites without dependence on third-party tools or external devices. Thus far, the security and efficiency of these "mental algorithms" has been analyzed only theoretically. But are such methods usable? We present the first usability study of humanly computable password strategies, involving a learning phase (to learn a password strategy), then a rehearsal phase (to login to a few websites), and multiple follow-up tests. In our user study, with training, participants were able to calculate a deterministic eight-character password for an arbitrary new website in under 20 seconds.

15.4LGJul 14, 2017

On the Complexity of Learning Neural Networks

Le Song, Santosh Vempala, John Wilmes et al.

The stunning empirical successes of neural networks currently lack rigorous theoretical explanation. What form would such an explanation take, in the face of existing complexity-theoretic lower bounds? A first step might be to show that data generated by neural networks with a single hidden layer, smooth activation functions and benign input distributions can be learned efficiently. We demonstrate here a comprehensive lower bound ruling out this possibility: for a wide class of activation functions (including all currently used), and inputs drawn from any logconcave distribution, there is a family of one-hidden-layer functions whose output is a sum gate, that are hard to learn in a precise sense: any statistical query algorithm (which includes all known variants of stochastic gradient descent with any loss function) needs an exponential number of queries even using tolerance inversely proportional to the input dimensionality. Moreover, this hard family of functions is realizable with a small (sublinear in dimension) number of activation units in the single hidden layer. The lower bound is also robust to small perturbations of the true weights. Systematic experiments illustrate a phase transition in the training error as predicted by the analysis.

12.2HCJul 5, 2017

The Complexity of Human Computation: A Concrete Model with an Application to Passwords

Manuel Blum, Santosh Vempala

What can humans compute in their heads? We are thinking of a variety of Crypto Protocols, games like Sudoku, Crossword Puzzles, Speed Chess, and so on. The intent of this paper is to apply the ideas and methods of theoretical computer science to better understand what humans can compute in their heads. For example, can a person compute a function in their head so that an eavesdropper with a powerful computer --- who sees the responses to random input --- still cannot infer responses to new inputs? To address such questions, we propose a rigorous model of human computation and associated measures of complexity. We apply the model and measures first and foremost to the problem of (1) humanly computable password generation, and then consider related problems of (2) humanly computable "one-way functions" and (3) humanly computable "pseudorandom generators". The theory of Human Computability developed here plays by different rules than standard computability, and this takes some getting used to. For reasons to be made clear, the polynomial versus exponential time divide of modern computability theory is irrelevant to human computation. In human computability, the step-counts for both humans and computers must be more concrete. Specifically, we restrict the adversary to at most 10^24 (Avogadro number of) steps. An alternate view of this work is that it deals with the analysis of algorithms and counting steps for the case that inputs are small as opposed to the usual case of inputs large-in-the-limit.

1.2DSJun 1, 2017

Geodesic Walks in Polytopes

Yin Tat Lee, Santosh S. Vempala

We introduce the geodesic walk for sampling Riemannian manifolds and apply it to the problem of generating uniform random points from polytopes in R^n specified by m inequalities. The walk is a discrete-time simulation of a stochastic differential equation (SDE) on the Riemannian manifold equipped with the metric induced by the Hessian of a convex function; each step is the solution of an ordinary differential equation (ODE). The resulting sampling algorithm for polytopes mixes in O*(mn^{3/4}) steps. This is the first walk that breaks the quadratic barrier for mixing in high dimension, improving on the previous best bound of O*(mn) by Kannan and Narayanan for the Dikin walk. We also show that each step of the geodesic walk (solving an ODE) can be implemented efficiently, thus improving the time complexity for sampling polytopes. Our analysis of the geodesic walk for general Hessian manifolds does not assume positive curvature and might be of independent interest.

3.5LGAug 12, 2016

Chi-squared Amplification: Identifying Hidden Hubs

Ravi Kannan, Santosh Vempala

We consider the following general hidden hubs model: an $n \times n$ random matrix $A$ with a subset $S$ of $k$ special rows (hubs): entries in rows outside $S$ are generated from the probability distribution $p_0 \sim N(0,σ_0^2)$; for each row in $S$, some $k$ of its entries are generated from $p_1 \sim N(0,σ_1^2)$, $σ_1>σ_0$, and the rest of the entries from $p_0$. The problem is to identify the high-degree hubs efficiently. This model includes and significantly generalizes the planted Gaussian Submatrix Model, where the special entries are all in a $k \times k$ submatrix. There are two well-known barriers: if $k\geq c\sqrt{n\ln n}$, just the row sums are sufficient to find $S$ in the general model. For the submatrix problem, this can be improved by a $\sqrt{\ln n}$ factor to $k \ge c\sqrt{n}$ by spectral methods or combinatorial methods. In the variant with $p_0=\pm 1$ (with probability $1/2$ each) and $p_1\equiv 1$, neither barrier has been broken. We give a polynomial-time algorithm to identify all the hidden hubs with high probability for $k \ge n^{0.5-δ}$ for some $δ>0$, when $σ_1^2>2σ_0^2$. The algorithm extends to the setting where planted entries might have different variances each at least as large as $σ_1^2$. We also show a nearly matching lower bound: for $σ_1^2 \le 2σ_0^2$, there is no polynomial-time Statistical Query algorithm for distinguishing between a matrix whose entries are all from $N(0,σ_0^2)$ and a matrix with $k=n^{0.5-δ}$ hidden hubs for any $δ>0$. The lower bound as well as the algorithm are related to whether the chi-squared distance of the two distributions diverges. At the critical value $σ_1^2=2σ_0^2$, we show that the general hidden hubs problem can be solved for $k\geq c\sqrt n(\ln n)^{1/4}$, improving on the naive row sum-based method.

35.7DSApr 24, 2016Code

Agnostic Estimation of Mean and Covariance

Kevin A. Lai, Anup B. Rao, Santosh Vempala

We consider the problem of estimating the mean and covariance of a distribution from iid samples in $\mathbb{R}^n$, in the presence of an $η$ fraction of malicious noise; this is in contrast to much recent work where the noise itself is assumed to be from a distribution of known type. The agnostic problem includes many interesting special cases, e.g., learning the parameters of a single Gaussian (or finding the best-fit Gaussian) when $η$ fraction of data is adversarially corrupted, agnostically learning a mixture of Gaussians, agnostic ICA, etc. We present polynomial-time algorithms to estimate the mean and covariance with error guarantees in terms of information-theoretic lower bounds. As a corollary, we also obtain an agnostic algorithm for Singular Value Decomposition.

2.9NEFeb 26, 2016

Cortical Computation via Iterative Constructions

Christos Papadimitrou, Samantha Petti, Santosh Vempala

We study Boolean functions of an arbitrary number of input variables that can be realized by simple iterative constructions based on constant-size primitives. This restricted type of construction needs little global coordination or control and thus is a candidate for neurally feasible computation. Valiant's construction of a majority function can be realized in this manner and, as we show, can be generalized to any uniform threshold function. We study the rate of convergence, finding that while linear convergence to the correct function can be achieved for any threshold using a fixed set of primitives, for quadratic convergence, the size of the primitives must grow as the threshold approaches 0 or 1. We also study finite realizations of this process and the learnability of the functions realized. We show that the constructions realized are accurate outside a small interval near the target threshold, where the size of the construction grows as the inverse square of the interval width. This phenomenon, that errors are higher closer to thresholds (and thresholds closer to the boundary are harder to represent), is a well-known cognitive finding.

17.3LGDec 30, 2015

Statistical Query Algorithms for Mean Vector Estimation and Stochastic Convex Optimization

Vitaly Feldman, Cristobal Guzman, Santosh Vempala

Stochastic convex optimization, where the objective is the expectation of a random convex function, is an important and widely used method with numerous applications in machine learning, statistics, operations research and other areas. We study the complexity of stochastic convex optimization given only statistical query (SQ) access to the objective function. We show that well-known and popular first-order iterative methods can be implemented using only statistical queries. For many cases of interest we derive nearly matching upper and lower bounds on the estimation (sample) complexity including linear optimization in the most general setting. We then present several consequences for machine learning, differential privacy and proving concrete lower bounds on the power of convex optimization based methods. The key ingredient of our work is SQ algorithms and lower bounds for estimating the mean vector of a distribution over vectors supported on a convex body in $\mathbb{R}^d$. This natural problem has not been previously studied and we show that our solutions can be used to get substantially improved SQ versions of Perceptron and other online algorithms for learning halfspaces.

4.8NEDec 26, 2014

Unsupervised Learning through Prediction in a Model of Cortex

Christos H. Papadimitriou, Santosh S. Vempala

We propose a primitive called PJOIN, for "predictive join," which combines and extends the operations JOIN and LINK, which Valiant proposed as the basis of a computational theory of cortex. We show that PJOIN can be implemented in Valiant's model. We also show that, using PJOIN, certain reasonably complex learning and pattern matching tasks can be performed, in a way that involves phenomena which have been observed in cognition and the brain, namely memory-based prediction and downward traffic in the cortical hierarchy.

5.1DSDec 9, 2014

Max vs Min: Tensor Decomposition and ICA with nearly Linear Sample Complexity

Santosh S. Vempala, Ying Xiao

We present a simple, general technique for reducing the sample complexity of matrix and tensor decomposition algorithms applied to distributions. We use the technique to give a polynomial-time algorithm for standard ICA with sample complexity nearly linear in the dimension, thereby improving substantially on previous bounds. The analysis is based on properties of random polynomials, namely the spacings of an ensemble of polynomials. Our technique also applies to other applications of tensor decompositions, including spherical Gaussian mixture models.

13.6LGNov 6, 2014

Efficient Representations for Life-Long Learning and Autoencoding

Maria-Florina Balcan, Avrim Blum, Santosh Vempala

It has been a long-standing goal in machine learning, as well as in AI more generally, to develop life-long learning systems that learn many different tasks over time, and reuse insights from tasks learned, "learning to learn" as they do so. In this work we pose and provide efficient algorithms for several natural theoretical formulations of this goal. Specifically, we consider the problem of learning many different target functions over time, that share certain commonalities that are initially unknown to the learning algorithm. Our aim is to learn new internal representations as the algorithm learns new target functions, that capture this commonality and allow subsequent learning tasks to be solved more efficiently and from less data. We develop efficient algorithms for two very different kinds of commonalities that target functions might share: one based on learning common low-dimensional and unions of low-dimensional subspaces and one based on learning nonlinear Boolean combinations of features. Our algorithms for learning Boolean feature combinations additionally have a dual interpretation, and can be viewed as giving an efficient procedure for constructing near-optimal sparse Boolean autoencoders under a natural "anchor-set" assumption.

12.1CRMar 31, 2014

Towards Human Computable Passwords

Jeremiah Blocki, Manuel Blum, Anupam Datta et al.

An interesting challenge for the cryptography community is to design authentication protocols that are so simple that a human can execute them without relying on a fully trusted computer. We propose several candidate authentication protocols for a setting in which the human user can only receive assistance from a semi-trusted computer --- a computer that stores information and performs computations correctly but does not provide confidentiality. Our schemes use a semi-trusted computer to store and display public challenges $C_i\in[n]^k$. The human user memorizes a random secret mapping $σ:[n]\rightarrow\mathbb{Z}_d$ and authenticates by computing responses $f(σ(C_i))$ to a sequence of public challenges where $f:\mathbb{Z}_d^k\rightarrow\mathbb{Z}_d$ is a function that is easy for the human to evaluate. We prove that any statistical adversary needs to sample $m=\tildeΩ(n^{s(f)})$ challenge-response pairs to recover $σ$, for a security parameter $s(f)$ that depends on two key properties of $f$. To obtain our results, we apply the general hypercontractivity theorem to lower bound the statistical dimension of the distribution over challenge-response pairs induced by $f$ and $σ$. Our lower bounds apply to arbitrary functions $f $ (not just to functions that are easy for a human to evaluate), and generalize recent results of Feldman et al. As an application, we propose a family of human computable password functions $f_{k_1,k_2}$ in which the user needs to perform $2k_1+2k_2+1$ primitive operations (e.g., adding two digits or remembering $σ(i)$), and we show that $s(f) = \min\{k_1+1, (k_2+1)/2\}$. For these schemes, we prove that forging passwords is equivalent to recovering the secret mapping. Thus, our human computable password schemes can maintain strong security guarantees even after an adversary has observed the user login to many different accounts.

19.1LGJun 25, 2013

Fourier PCA and Robust Tensor Decomposition

Navin Goyal, Santosh Vempala, Ying Xiao

Fourier PCA is Principal Component Analysis of a matrix obtained from higher order derivatives of the logarithm of the Fourier transform of a distribution.We make this method algorithmic by developing a tensor decomposition method for a pair of tensors sharing the same vectors in rank-$1$ decompositions. Our main application is the first provably polynomial-time algorithm for underdetermined ICA, i.e., learning an $n \times m$ matrix $A$ from observations $y=Ax$ where $x$ is drawn from an unknown product distribution with arbitrary non-Gaussian components. The number of component distributions $m$ can be arbitrarily higher than the dimension $n$ and the columns of $A$ only need to satisfy a natural and efficiently verifiable nondegeneracy condition. As a second application, we give an alternative algorithm for learning mixtures of spherical Gaussians with linearly independent means. These results also hold in the presence of Gaussian noise.