Holden Lee

LG
h-index15
29papers
1,452citations
Novelty61%
AI Score53

29 Papers

LGNov 3, 2022
Improved Analysis of Score-based Generative Modeling: User-Friendly Bounds under Minimal Smoothness Assumptions

Hongrui Chen, Holden Lee, Jianfeng Lu

We give an improved theoretical analysis of score-based generative modeling. Under a score estimate with small $L^2$ error (averaged across timesteps), we provide efficient convergence guarantees for any data distribution with second-order moment, by either employing early stopping or assuming smoothness condition on the score function of the data distribution. Our result does not rely on any log-concavity or functional inequality assumption and has a logarithmic dependence on the smoothness. In particular, we show that under only a finite second moment condition, approximating the following in reverse KL divergence in $ε$-accuracy can be done in $\tilde O\left(\frac{d \log (1/δ)}ε\right)$ steps: 1) the variance-$δ$ Gaussian perturbation of any data distribution; 2) data distributions with $1/δ$-smooth score functions. Our analysis also provides a quantitative comparison between different discrete approximations and may guide the choice of discretization points in practice.

LGJun 13, 2022
Convergence for score-based generative modeling with polynomial complexity

Holden Lee, Jianfeng Lu, Yixin Tan

Score-based generative modeling (SGM) is a highly successful approach for learning a probability distribution from data and generating further samples. We prove the first polynomial convergence guarantees for the core mechanic behind SGM: drawing samples from a probability density $p$ given a score estimate (an estimate of $\nabla \ln p$) that is accurate in $L^2(p)$. Compared to previous works, we do not incur error that grows exponentially in time or that suffers from a curse of dimensionality. Our guarantee works for any smooth distribution and depends polynomially on its log-Sobolev constant. Using our guarantee, we give a theoretical analysis of score-based generative modeling, which transforms white-noise input into samples from a learned data distribution given score estimates at different noise scales. Our analysis gives theoretical grounding to the observation that an annealed procedure is required in practice to generate good samples, as our proof depends essentially on using annealing to obtain a warm start at each step. Moreover, we show that a predictor-corrector algorithm gives better convergence than using either portion alone.

LGSep 26, 2022
Convergence of score-based generative modeling for general data distributions

Holden Lee, Jianfeng Lu, Yixin Tan

Score-based generative modeling (SGM) has grown to be a hugely successful method for learning to generate samples from complex data distributions such as that of images and audio. It is based on evolving an SDE that transforms white noise into a sample from the learned distribution, using estimates of the score function, or gradient log-pdf. Previous convergence analyses for these methods have suffered either from strong assumptions on the data distribution or exponential dependencies, and hence fail to give efficient guarantees for the multimodal and non-smooth distributions that arise in practice and for which good empirical performance is observed. We consider a popular kind of SGM -- denoising diffusion models -- and give polynomial convergence guarantees for general data distributions, with no assumptions related to functional inequalities or smoothness. Assuming $L^2$-accurate score estimates, we obtain Wasserstein distance guarantees for any distribution of bounded support or sufficiently decaying tails, as well as TV guarantees for distributions with further smoothness assumptions.

LGJun 3, 2023
Provable benefits of score matching

Chirag Pabbaraju, Dhruv Rohatgi, Anish Sevekari et al.

Score matching is an alternative to maximum likelihood (ML) for estimating a probability distribution parametrized up to a constant of proportionality. By fitting the ''score'' of the distribution, it sidesteps the need to compute this constant of proportionality (which is often intractable). While score matching and variants thereof are popular in practice, precise theoretical understanding of the benefits and tradeoffs with maximum likelihood -- both computational and statistical -- are not well understood. In this work, we give the first example of a natural exponential family of distributions such that the score matching loss is computationally efficient to optimize, and has a comparable statistical efficiency to ML, while the ML loss is intractable to optimize using a gradient-based method. The family consists of exponentials of polynomials of fixed degree, and our result can be viewed as a continuous analogue of recent developments in the discrete setting. Precisely, we show: (1) Designing a zeroth-order or first-order oracle for optimizing the maximum likelihood loss is NP-hard. (2) Maximum likelihood has a statistical efficiency polynomial in the ambient dimension and the radius of the parameters of the family. (3) Minimizing the score matching loss is both computationally and statistically efficient, with complexity polynomial in the ambient dimension.

LGOct 1, 2022
Pitfalls of Gaussians as a noise distribution in NCE

Holden Lee, Chirag Pabbaraju, Anish Sevekari et al.

Noise Contrastive Estimation (NCE) is a popular approach for learning probability density functions parameterized up to a constant of proportionality. The main idea is to design a classification problem for distinguishing training data from samples from an easy-to-sample noise distribution $q$, in a manner that avoids having to calculate a partition function. It is well-known that the choice of $q$ can severely impact the computational and statistical efficiency of NCE. In practice, a common choice for $q$ is a Gaussian which matches the mean and covariance of the data. In this paper, we show that such a choice can result in an exponentially bad (in the ambient dimension) conditioning of the Hessian of the loss, even for very simple data distributions. As a consequence, both the statistical and algorithmic complexity for such a choice of $q$ will be problematic in practice, suggesting that more complex noise distributions are essential to the success of NCE.

LGSep 19, 2024
What does guidance do? A fine-grained analysis in a simple setting

Muthu Chidambaram, Khashayar Gatmiry, Sitan Chen et al.

The use of guidance in diffusion models was originally motivated by the premise that the guidance-modified score is that of the data distribution tilted by a conditional likelihood raised to some power. In this work we clarify this misconception by rigorously proving that guidance fails to sample from the intended tilted distribution. Our main result is to give a fine-grained characterization of the dynamics of guidance in two cases, (1) mixtures of compactly supported distributions and (2) mixtures of Gaussians, which reflect salient properties of guidance that manifest on real-world data. In both cases, we prove that as the guidance parameter increases, the guided model samples more heavily from the boundary of the support of the conditional distribution. We also prove that for any nonzero level of score estimation error, sufficiently large guidance will result in sampling away from the support, theoretically justifying the empirical finding that large guidance results in distorted generations. In addition to verifying these results empirically in synthetic settings, we also show how our theoretical insights can offer useful prescriptions for practical deployment.

MLOct 5, 2022
Fisher information lower bounds for sampling

Sinho Chewi, Patrik Gerber, Holden Lee et al.

We prove two lower bounds for the complexity of non-log-concave sampling within the framework of Balasubramanian et al. (2022), who introduced the use of Fisher information (FI) bounds as a notion of approximate first-order stationarity in sampling. Our first lower bound shows that averaged LMC is optimal for the regime of large FI by reducing the problem of finding stationary points in non-convex optimization to sampling. Our second lower bound shows that in the regime of small FI, obtaining a FI of at most $\varepsilon^2$ from the target distribution requires $\text{poly}(1/\varepsilon)$ queries, which is surprising as it rules out the existence of high-accuracy algorithms (e.g., algorithms using Metropolis-Hastings filters) in this context.

LGApr 3, 2023
Improved Bound for Mixing Time of Parallel Tempering

Holden Lee, Zeyu Shen

In the field of sampling algorithms, MCMC (Markov Chain Monte Carlo) methods are widely used when direct sampling is not possible. However, multimodality of target distributions often leads to slow convergence and mixing. One common solution is parallel tempering. Though highly effective in practice, theoretical guarantees on its performance are limited. In this paper, we present a new lower bound for parallel tempering on the spectral gap that has a polynomial dependence on all parameters except $\log L$, where $(L + 1)$ is the number of levels. This improves the best existing bound which depends exponentially on the number of modes. Moreover, we complement our result with a hypothetical upper bound on spectral gap that has an exponential dependence on $\log L$, which shows that, in some sense, our bound is tight.

STMar 17
Convergence Bounds for Sequential Monte Carlo on Multimodal Distributions using Soft Decomposition

Holden Lee, Matheau Santana-Gijzen

We prove bounds on the variance of a function $f$ under the empirical measure of the samples obtained by the Sequential Monte Carlo (SMC) algorithm, with time complexity depending on local rather than global Markov chain mixing dynamics. SMC is a Markov Chain Monte Carlo (MCMC) method, which starts by drawing $N$ particles from a known distribution, and then, through a sequence of distributions, re-weights and re-samples the particles, at each instance applying a Markov chain for smoothing. In principle, SMC tries to alleviate problems from multi-modality. However, most theoretical guarantees for SMC are obtained by assuming global mixing time bounds, which are only efficient in the uni-modal setting. We show that bounds can be obtained in the truly multi-modal setting, with mixing times that depend only on local MCMC dynamics.

PRMay 5
Potential Hessian Ascent III: Sampling the Sherrington--Kirkpatrick Model at Beta < 1/2

Ewan Davies, Holden Lee, Juspreet Singh Sandhu et al.

We give a polynomial-time algorithm to sample from the Gibbs measure of the Sherrington--Kirkpatrick model with negligible total-variation distance (TVD) error up to inverse temperature $β< 1/2$. Prior work obtained TVD error guarantees only up to $β\approx 0.295$, while results covering the entire replica-symmetric regime $β< 1$ gave guarantees only in Wasserstein distance. Our approach demonstrates that the same potential Hessian ascent previously developed for optimization also functions as a sampling algorithm by implementing algorithmic stochastic localization at high temperature. By estimating the covariance of the tilted Gibbs distribution via Gaussian integration by parts, overlap concentration, and precise cavity estimates, we show that a Hessian-ascent process achieves an $O(1)$ Wasserstein error guarantee for finite-time localization, improving on the previous $o(n)$. A careful comparison of stochastic localization with the Hessian ascent process and a free probability argument controlling the diagonal sub-algebra of the Hessian improves this to $O(1)$ in KL divergence. We then use Jarzynski's equality with rejection sampling, along with a restricted log-Sobolev inequality on the time-$T$ localized distribution, to refine the error to $o(1)$ in TVD up to a constant time $T$ and to complete the sampling with Glauber dynamics.

MLDec 19, 2025
Sampling from multimodal distributions with warm starts: Non-asymptotic bounds for the Reweighted Annealed Leap-Point Sampler

Holden Lee, Matheau Santana-Gijzen

Sampling from multimodal distributions is a central challenge in Bayesian inference and machine learning. In light of hardness results for sampling -- classical MCMC methods, even with tempering, can suffer from exponential mixing times -- a natural question is how to leverage additional information, such as a warm start point for each mode, to enable faster mixing across modes. To address this, we introduce Reweighted ALPS (Re-ALPS), a modified version of the Annealed Leap-Point Sampler (ALPS) that dispenses with the Gaussian approximation assumption. We prove the first polynomial-time bound that works in a general setting, under a natural assumption that each component contains significant mass relative to the others when tilted towards the corresponding warm start point. Similarly to ALPS, we define distributions tilted towards a mixture centered at the warm start points, and at the coldest level, use teleportation between warm start points to enable efficient mixing across modes. In contrast to ALPS, our method does not require Hessian information at the modes, but instead estimates component partition functions via Monte Carlo. This additional estimation step is crucial in allowing the algorithm to handle target distributions with more complex geometries besides approximate Gaussian. For the proof, we show convergence results for Markov processes when only part of the stationary distribution is well-mixing and estimation for partition functions for individual components of a mixture. We numerically evaluate our algorithm's mixing performance compared to ALPS on a mixture of heavy-tailed distributions.

LGApr 29, 2024
Learning Mixtures of Gaussians Using Diffusion Models

Khashayar Gatmiry, Jonathan Kelner, Holden Lee

We give a new algorithm for learning mixtures of $k$ Gaussians (with identity covariance in $\mathbb{R}^n$) to TV error $\varepsilon$, with quasi-polynomial ($O(n^{\text{poly\,log}\left(\frac{n+k}{\varepsilon}\right)})$) time and sample complexity, under a minimum weight assumption. Our results extend to continuous mixtures of Gaussians where the mixing distribution is supported on a union of $k$ balls of constant radius. In particular, this applies to the case of Gaussian convolutions of distributions on low-dimensional manifolds, or more generally sets with small covering number, for which no sub-exponential algorithm was previously known. Unlike previous approaches, most of which are algebraic in nature, our approach is analytic and relies on the framework of diffusion models. Diffusion models are a modern paradigm for generative modeling, which typically rely on learning the score function (gradient log-pdf) along a process transforming a pure noise distribution, in our case a Gaussian, to the data distribution. Despite their dazzling performance in tasks such as image generation, there are few end-to-end theoretical guarantees that they can efficiently learn nontrivial families of distributions; we give some of the first such guarantees. We proceed by deriving higher-order Gaussian noise sensitivity bounds for the score functions for a Gaussian mixture to show that that they can be inductively learned using piecewise polynomial regression (up to poly-logarithmic degree), and combine this with known convergence results for diffusion models.

LGNov 14, 2024
Efficiently learning and sampling multimodal distributions with data-based initialization

Frederic Koehler, Holden Lee, Thuy-Duong Vuong

We consider the problem of sampling a multimodal distribution with a Markov chain given a small number of samples from the stationary measure. Although mixing can be arbitrarily slow, we show that if the Markov chain has a $k$th order spectral gap, initialization from a set of $\tilde O(k/\varepsilon^2)$ samples from the stationary distribution will, with high probability over the samples, efficiently generate a sample whose conditional law is $\varepsilon$-close in TV distance to the stationary measure. In particular, this applies to mixtures of $k$ distributions satisfying a Poincaré inequality, with faster convergence when they satisfy a log-Sobolev inequality. Our bounds are stable to perturbations to the Markov chain, and in particular work for Langevin diffusion over $\mathbb R^d$ with score estimation error, as well as Glauber dynamics combined with approximation error from pseudolikelihood estimation. This justifies the success of data-based initialization for score matching methods despite slow mixing for the data distribution, and improves and generalizes the results of Koehler and Vuong (2023) to have linear, rather than exponential, dependence on $k$ and apply to arbitrary semigroups. As a consequence of our results, we show for the first time that a natural class of low-complexity Ising measures can be efficiently learned from samples.

LGFeb 15, 2024
How Flawed Is ECE? An Analysis via Logit Smoothing

Muthu Chidambaram, Holden Lee, Colin McSwiggen et al.

Informally, a model is calibrated if its predictions are correct with a probability that matches the confidence of the prediction. By far the most common method in the literature for measuring calibration is the expected calibration error (ECE). Recent work, however, has pointed out drawbacks of ECE, such as the fact that it is discontinuous in the space of predictors. In this work, we ask: how fundamental are these issues, and what are their impacts on existing results? Towards this end, we completely characterize the discontinuities of ECE with respect to general probability measures on Polish spaces. We then use the nature of these discontinuities to motivate a novel continuous, easily estimated miscalibration metric, which we term Logit-Smoothed ECE (LS-ECE). By comparing the ECE and LS-ECE of pre-trained image classification models, we show in initial experiments that binned ECE closely tracks LS-ECE, indicating that the theoretical pathologies of ECE may be avoidable in practice.

CLDec 29, 2023
Principled Gradient-based Markov Chain Monte Carlo for Text Generation

Li Du, Afra Amini, Lucas Torroba Hennigen et al. · mit

Recent papers have demonstrated the possibility of energy-based text generation by adapting gradient-based sampling algorithms, a paradigm of MCMC algorithms that promises fast convergence. However, as we show in this paper, previous attempts on this approach to text generation all fail to sample correctly from the target language model distributions. To address this limitation, we consider the problem of designing text samplers that are faithful, meaning that they have the target text distribution as its limiting distribution. We propose several faithful gradient-based sampling algorithms to sample from the target energy-based text distribution correctly, and study their theoretical properties. Through experiments on various forms of text generation, we demonstrate that faithful samplers are able to generate more fluent text while adhering to the control objectives better.

LGMay 19, 2023
The probability flow ODE is provably fast

Sitan Chen, Sinho Chewi, Holden Lee et al.

We provide the first polynomial-time convergence guarantees for the probability flow ODE implementation (together with a corrector step) of score-based generative modeling. Our analysis is carried out in the wake of recent results obtaining such guarantees for the SDE-based implementation (i.e., denoising diffusion probabilistic modeling or DDPM), but requires the development of novel techniques for studying deterministic dynamics without contractivity. Through the use of a specially chosen corrector step based on the underdamped Langevin diffusion, we obtain better dimension dependence than prior works on DDPM ($O(\sqrt{d})$ vs. $O(d)$, assuming smoothness of the data distribution), highlighting potential advantages of the ODE framework.

DSFeb 17, 2022
Sampling Approximately Low-Rank Ising Models: MCMC meets Variational Methods

Frederic Koehler, Holden Lee, Andrej Risteski

We consider Ising models on the hypercube with a general interaction matrix $J$, and give a polynomial time sampling algorithm when all but $O(1)$ eigenvalues of $J$ lie in an interval of length one, a situation which occurs in many models of interest. This was previously known for the Glauber dynamics when *all* eigenvalues fit in an interval of length one; however, a single outlier can force the Glauber dynamics to mix torpidly. Our general result implies the first polynomial time sampling algorithms for low-rank Ising models such as Hopfield networks with a fixed number of patterns and Bayesian clustering models with low-dimensional contexts, and greatly improves the polynomial time sampling regime for the antiferromagnetic/ferromagnetic Ising model with inconsistent field on expander graphs. It also improves on previous approximation algorithm results based on the naive mean-field approximation in variational methods and statistical physics. Our approach is based on a new fusion of ideas from the MCMC and variational inference worlds. As part of our algorithm, we define a new nonconvex variational problem which allows us to sample from an exponential reweighting of a distribution by a negative definite quadratic form, and show how to make this procedure provably efficient using stochastic gradient descent. On top of this, we construct a new simulated tempering chain (on an extended state space arising from the Hubbard-Stratonovich transform) which overcomes the obstacle posed by large positive eigenvalues, and combine it with the SGD-based sampler to solve the full problem.

LGJul 7, 2021
Universal Approximation for Log-concave Distributions using Well-conditioned Normalizing Flows

Holden Lee, Chirag Pabbaraju, Anish Sevekari et al.

Normalizing flows are a widely used class of latent-variable generative models with a tractable likelihood. Affine-coupling (Dinh et al, 2014-16) models are a particularly common type of normalizing flows, for which the Jacobian of the latent-to-observable-variable transformation is triangular, allowing the likelihood to be computed in linear time. Despite the widespread usage of affine couplings, the special structure of the architecture makes understanding their representational power challenging. The question of universal approximation was only recently resolved by three parallel papers (Huang et al.,2020;Zhang et al.,2020;Koehler et al.,2020) -- who showed reasonably regular distributions can be approximated arbitrarily well using affine couplings -- albeit with networks with a nearly-singular Jacobian. As ill-conditioned Jacobians are an obstacle for likelihood-based training, the fundamental question remains: which distributions can be approximated using well-conditioned affine coupling flows? In this paper, we show that any log-concave distribution can be approximated using well-conditioned affine-coupling flows. In terms of proof techniques, we uncover and leverage deep connections between affine coupling architectures, underdamped Langevin dynamics (a stochastic differential equation often used to sample from Gibbs measures) and Hénon maps (a structured dynamical system that appears in the study of symplectic diffeomorphisms). Our results also inform the practice of training affine couplings: we approximate a padded version of the input distribution with iid Gaussians -- a strategy which Koehler et al.(2020) empirically observed to result in better-conditioned flows, but had hitherto no theoretical grounding. Our proof can thus be seen as providing theoretical evidence for the benefits of Gaussian padding when training normalizing flows.

LGNov 19, 2020
Improved rates for prediction and identification of partially observed linear dynamical systems

Holden Lee

Identification of a linear time-invariant dynamical system from partial observations is a fundamental problem in control theory. Particularly challenging are systems exhibiting long-term memory. A natural question is how learn such systems with non-asymptotic statistical rates depending on the inherent dimensionality (order) $d$ of the system, rather than on the possibly much larger memory length. We propose an algorithm that given a single trajectory of length $T$ with gaussian observation noise, learns the system with a near-optimal rate of $\widetilde O\left(\sqrt\frac{d}{T}\right)$ in $\mathcal{H}_2$ error, with only logarithmic, rather than polynomial dependence on memory length. We also give bounds under process noise and improved bounds for learning a realization of the system. Our algorithm is based on multi-scale low-rank approximation: SVD applied to Hankel matrices of geometrically increasing sizes. Our analysis relies on careful application of concentration bounds on the Fourier domain -- we give sharper concentration bounds for sample covariance of correlated inputs and for $\mathcal H_\infty$ norm estimation, which may be of independent interest.

LGSep 30, 2020
Efficient sampling from the Bingham distribution

Rong Ge, Holden Lee, Jianfeng Lu et al.

We give a algorithm for exact sampling from the Bingham distribution $p(x)\propto \exp(x^\top A x)$ on the sphere $\mathcal S^{d-1}$ with expected runtime of $\operatorname{poly}(d, λ_{\max}(A)-λ_{\min}(A))$. The algorithm is based on rejection sampling, where the proposal distribution is a polynomial approximation of the pdf, and can be sampled from by explicitly evaluating integrals of polynomials over the sphere. Our algorithm gives exact samples, assuming exact computation of an inverse function of a polynomial. This is in contrast with Markov Chain Monte Carlo algorithms, which are not known to enjoy rapid mixing on this problem, and only give approximate samples. As a direct application, we use this to sample from the posterior distribution of a rank-1 matrix inference problem in polynomial time.

LGFeb 6, 2020
No-Regret Prediction in Marginally Stable Systems

Udaya Ghai, Holden Lee, Karan Singh et al.

We consider the problem of online prediction in a marginally stable linear dynamical system subject to bounded adversarial or (non-isotropic) stochastic perturbations. This poses two challenges. Firstly, the system is in general unidentifiable, so recent and classical results on parameter recovery do not apply. Secondly, because we allow the system to be marginally stable, the state can grow polynomially with time; this causes standard regret bounds in online convex optimization to be vacuous. In spite of these challenges, we show that the online least-squares algorithm achieves sublinear regret (improvable to polylogarithmic in the stochastic setting), with polynomial dependence on the system's parameters. This requires a refined regret analysis, including a structural lemma showing the current state of the system to be a small linear combination of past states, even if the state grows polynomially. By applying our techniques to learning an autoregressive filter, we also achieve logarithmic regret in the partially observed setting under Gaussian noise, with polynomial dependence on the memory of the associated Kalman filter.

DSNov 8, 2019
Estimating Normalizing Constants for Log-Concave Distributions: Algorithms and Lower Bounds

Rong Ge, Holden Lee, Jianfeng Lu

Estimating the normalizing constant of an unnormalized probability distribution has important applications in computer science, statistical physics, machine learning, and statistics. In this work, we consider the problem of estimating the normalizing constant $Z=\int_{\mathbb{R}^d} e^{-f(x)}\,\mathrm{d}x$ to within a multiplication factor of $1 \pm \varepsilon$ for a $μ$-strongly convex and $L$-smooth function $f$, given query access to $f(x)$ and $\nabla f(x)$. We give both algorithms and lowerbounds for this problem. Using an annealing algorithm combined with a multilevel Monte Carlo method based on underdamped Langevin dynamics, we show that $\widetilde{\mathcal{O}}\Bigl(\frac{d^{4/3}κ+ d^{7/6}κ^{7/6}}{\varepsilon^2}\Bigr)$ queries to $\nabla f$ are sufficient, where $κ= L / μ$ is the condition number. Moreover, we provide an information theoretic lowerbound, showing that at least $\frac{d^{1-o(1)}}{\varepsilon^{2-o(1)}}$ queries are necessary. This provides a first nontrivial lowerbound for the problem.

LGJun 14, 2019
Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets

Rohith Kuditipudi, Xiang Wang, Holden Lee et al.

Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optima -- at least those discovered by gradient-based optimization -- turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. We give mathematical explanations for this phenomenon, assuming generic properties (such as dropout stability and noise stability) of well-trained deep nets, which have previously been identified as part of understanding the generalization properties of deep nets. Our explanation holds for realistic multilayer nets, and experiments are presented to verify the theory.

LGMay 23, 2019
Robust guarantees for learning an autoregressive filter

Holden Lee, Cyril Zhang

The optimal predictor for a linear dynamical system (with hidden state and Gaussian noise) takes the form of an autoregressive linear filter, namely the Kalman filter. However, a fundamental problem in reinforcement learning and control theory is to make optimal predictions in an unknown dynamical system. To this end, we take the approach of directly learning an autoregressive filter for time-series prediction under unknown dynamics. Our analysis differs from previous statistical analyses in that we regress not only on the inputs to the dynamical system, but also the outputs, which is essential to dealing with process noise. The main challenge is to estimate the filter under worst case input (in $\mathcal H_\infty$ norm), for which we use an $L^\infty$-based objective rather than ordinary least-squares. For learning an autoregressive model, our algorithm has optimal sample complexity in terms of the rollout length, which does not seem to be attained by naive least-squares.

LGFeb 21, 2019
Online Sampling from Log-Concave Distributions

Holden Lee, Oren Mangoubi, Nisheeth K. Vishnoi

Given a sequence of convex functions $f_0, f_1, \ldots, f_T$, we study the problem of sampling from the Gibbs distribution $π_t \propto e^{-\sum_{k=0}^tf_k}$ for each epoch $t$ in an online manner. Interest in this problem derives from applications in machine learning, Bayesian statistics, and optimization where, rather than obtaining all the observations at once, one constantly acquires new data, and must continuously update the distribution. Our main result is an algorithm that generates roughly independent samples from $π_t$ for every epoch $t$ and, under mild assumptions, makes $\mathrm{polylog}(T)$ gradient evaluations per epoch. All previous results imply a bound on the number of gradient or function evaluations which is at least linear in $T$. Motivated by real-world applications, we assume that functions are smooth, their associated distributions have a bounded second moment, and their minimizer drifts in a bounded manner, but do not assume they are strongly convex. In particular, our assumptions hold for online Bayesian logistic regression, when the data satisfy natural regularity properties, giving a sampling algorithm with updates that are poly-logarithmic in $T$. In simulations, our algorithm achieves accuracy comparable to an algorithm specialized to logistic regression. Key to our algorithm is a novel stochastic gradient Langevin dynamics Markov chain with a carefully designed variance reduction step and constant batch size. Technically, lack of strong convexity is a significant barrier to analysis and, here, our main contribution is a martingale exit time argument that shows our Markov chain remains in a ball of radius roughly poly-logarithmic in $T$ for enough time to reach within $\varepsilon$ of $π_t$.

LGNov 29, 2018
Simulated Tempering Langevin Monte Carlo II: An Improved Proof using Soft Markov Chain Decomposition

Rong Ge, Holden Lee, Andrej Risteski

A key task in Bayesian machine learning is sampling from distributions that are only specified up to a partition function (i.e., constant of proportionality). One prevalent example of this is sampling posteriors in parametric distributions, such as latent-variable generative models. However sampling (even very approximately) can be #P-hard. Classical results going back to Bakry and Émery (1985) on sampling focus on log-concave distributions, and show a natural Markov chain called Langevin diffusion mixes in polynomial time. However, all log-concave distributions are uni-modal, while in practice it is very common for the distribution of interest to have multiple modes. In this case, Langevin diffusion suffers from torpid mixing. We address this problem by combining Langevin diffusion with simulated tempering. The result is a Markov chain that mixes more rapidly by transitioning between different temperatures of the distribution. We analyze this Markov chain for a mixture of (strongly) log-concave distributions of the same shape. In particular, our technique applies to the canonical multi-modal distribution: a mixture of gaussians (of equal variance). Our algorithm efficiently samples from these distributions given only access to the gradient of the log-pdf. For the analysis, we introduce novel techniques for proving spectral gaps based on decomposing the action of the generator of the diffusion. Previous approaches rely on decomposing the state space as a partition of sets, while our approach can be thought of as decomposing the stationary measure as a mixture of distributions (a "soft partition"). Additional materials for the paper can be found at http://holdenlee.github.io/Simulated%20tempering%20Langevin%20Monte%20Carlo.html. The proof and results have been improved and generalized from the precursor at arXiv:1710.02736.

LGFeb 12, 2018
Spectral Filtering for General Linear Dynamical Systems

Elad Hazan, Holden Lee, Karan Singh et al.

We give a polynomial-time algorithm for learning latent-state linear dynamical systems without system identification, and without assumptions on the spectral radius of the system's transition matrix. The algorithm extends the recently introduced technique of spectral filtering, previously applied only to systems with a symmetric transition matrix, using a novel convex relaxation to allow for the efficient identification of phases.

LGOct 7, 2017
Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo

Rong Ge, Holden Lee, Andrej Risteski

A key task in Bayesian statistics is sampling from distributions that are only specified up to a partition function (i.e., constant of proportionality). However, without any assumptions, sampling (even approximately) can be #P-hard, and few works have provided "beyond worst-case" guarantees for such settings. For log-concave distributions, classical results going back to Bakry and Émery (1985) show that natural continuous-time Markov chains called Langevin diffusions mix in polynomial time. The most salient feature of log-concavity violated in practice is uni-modality: commonly, the distributions we wish to sample from are multi-modal. In the presence of multiple deep and well-separated modes, Langevin diffusion suffers from torpid mixing. We address this problem by combining Langevin diffusion with simulated tempering. The result is a Markov chain that mixes more rapidly by transitioning between different temperatures of the distribution. We analyze this Markov chain for the canonical multi-modal distribution: a mixture of gaussians (of equal variance). The algorithm based on our Markov chain provably samples from distributions that are close to mixtures of gaussians, given access to the gradient of the log-pdf. For the analysis, we use a spectral decomposition theorem for graphs (Gharan and Trevisan, 2014) and a Markov chain decomposition technique (Madras and Randall, 2002).

LGFeb 22, 2017
On the ability of neural nets to express distributions

Holden Lee, Rong Ge, Tengyu Ma et al.

Deep neural nets have caused a revolution in many classification tasks. A related ongoing revolution -- also theoretically not understood -- concerns their ability to serve as generative models for complicated types of data such as images and texts. These models are trained using ideas like variational autoencoders and Generative Adversarial Networks. We take a first cut at explaining the expressivity of multilayer nets by giving a sufficient criterion for a function to be approximable by a neural network with $n$ hidden layers. A key ingredient is Barron's Theorem \cite{Barron1993}, which gives a Fourier criterion for approximability of a function by a neural network with 1 hidden layer. We show that a composition of $n$ functions which satisfy certain Fourier conditions ("Barron functions") can be approximated by a $n+1$-layer neural network. For probability distributions, this translates into a criterion for a probability distribution to be approximable in Wasserstein distance -- a natural metric on probability distributions -- by a neural network applied to a fixed base distribution (e.g., multivariate gaussian). Building up recent lower bound work, we also give an example function that shows that composition of Barron functions is more expressive than Barron functions alone.