Tengyuan Liang

ST
h-index23
35papers
1,996citations
Novelty56%
AI Score46

35 Papers

STApr 9, 2022
High-dimensional Asymptotics of Langevin Dynamics in Spiked Matrix Models

Tengyuan Liang, Subhabrata Sen, Pragya Sur

We study Langevin dynamics for recovering the planted signal in the spiked matrix model. We provide a "path-wise" characterization of the overlap between the output of the Langevin algorithm and the planted signal. This overlap is characterized in terms of a self-consistent system of integro-differential equations, usually referred to as the Crisanti-Horner-Sommers-Cugliandolo-Kurchan (CHSCK) equations in the spin glass literature. As a second contribution, we derive an explicit formula for the limiting overlap in terms of the signal-to-noise ratio and the injected noise in the diffusion. This uncovers a sharp phase transition -- in one regime, the limiting overlap is strictly positive, while in the other, the injected noise overcomes the signal, and the limiting overlap is zero.

MLDec 5, 2022
Blessings and Curses of Covariate Shifts: Adversarial Learning Dynamics, Directional Convergence, and Equilibria

Tengyuan Liang

Covariate distribution shifts and adversarial perturbations present robustness challenges to the conventional statistical learning framework: mild shifts in the test covariate distribution can significantly affect the performance of the statistical model learned based on the training distribution. The model performance typically deteriorates when extrapolation happens: namely, covariates shift to a region where the training distribution is scarce, and naturally, the learned model has little information. For robustness and regularization considerations, adversarial perturbation techniques are proposed as a remedy; however, careful study needs to be carried out about what extrapolation region adversarial covariate shift will focus on, given a learned model. This paper precisely characterizes the extrapolation region, examining both regression and classification in an infinite-dimensional setting. We study the implications of adversarial covariate shifts to subsequent learning of the equilibrium -- the Bayes optimal model -- in a sequential game framework. We exploit the dynamics of the adversarial learning game and reveal the curious effects of the covariate shift to equilibrium learning and experimental design. In particular, we establish two directional convergence results that exhibit distinctive phenomena: (1) a blessing in regression, the adversarial covariate shifts in an exponential rate to an optimal experimental design for rapid subsequent learning; (2) a curse in classification, the adversarial covariate shifts in a subquadratic rate to the hardest experimental design trapping subsequent learning.

STDec 10, 2025
Distributional Shrinkage II: Optimal Transport Denoisers with Higher-Order Scores

Tengyuan Liang

We revisit the signal denoising problem through the lens of optimal transport: the goal is to recover an unknown scalar signal distribution $X \sim P$ from noisy observations $Y = X + σZ$, with $Z$ being standard Gaussian independent of $X$ and $σ>0$ a known noise level. Let $Q$ denote the distribution of $Y$. We introduce a hierarchy of denoisers $T_0, T_1, \ldots, T_\infty : \mathbb{R} \to \mathbb{R}$ that are agnostic to the signal distribution $P$, depending only on higher-order score functions of $Q$. Each denoiser $T_K$ is progressively refined using the $(2K-1)$-th order score function of $Q$ at noise resolution $σ^{2K}$, achieving better denoising quality measured by the Wasserstein metric $W(T_K \sharp Q, P)$. The limiting denoiser $T_\infty$ identifies the optimal transport map with $T_\infty \sharp Q = P$. We provide a complete characterization of the combinatorial structure underlying this hierarchy through Bell polynomial recursions, revealing how higher-order score functions encode the optimal transport map for signal denoising. We study two estimation strategies with convergence rates for higher-order scores from i.i.d. samples drawn from $Q$: (i) plug-in estimation via Gaussian kernel smoothing, and (ii) direct estimation via higher-order score matching. This hierarchy of agnostic denoisers opens new perspectives in signal denoising and empirical Bayes.

MLNov 12, 2025
Distributional Shrinkage I: Universal Denoisers in Multi-Dimensions

Tengyuan Liang

We revisit the problem of denoising from noisy measurements where only the noise level is known, not the noise distribution. In multi-dimensions, independent noise $Z$ corrupts the signal $X$, resulting in the noisy measurement $Y = X + σZ$, where $σ\in (0, 1)$ is a known noise level. Our goal is to recover the underlying signal distribution $P_X$ from denoising $P_Y$. We propose and analyze universal denoisers that are agnostic to a wide range of signal and noise distributions. Our distributional denoisers offer order-of-magnitude improvements over the Bayes-optimal denoiser derived from Tweedie's formula, if the focus is on the entire distribution $P_X$ rather than on individual realizations of $X$. Our denoisers shrink $P_Y$ toward $P_X$ optimally, achieving $O(σ^4)$ and $O(σ^6)$ accuracy in matching generalized moments and density functions. Inspired by optimal transport theory, the proposed denoisers are optimal in approximating the Monge-Ampère equation with higher-order accuracy, and can be implemented efficiently via score matching. Let $q$ represent the density of $P_Y$; for optimal distributional denoising, we recommend replacing the Bayes-optimal denoiser, \[ \mathbf{T}^*(y) = y + σ^2 \nabla \log q(y), \] with denoisers exhibiting less aggressive distributional shrinkage, \[ \mathbf{T}_1(y) = y + \frac{σ^2}{2} \nabla \log q(y), \] \[ \mathbf{T}_2(y) = y + \frac{σ^2}{2} \nabla \log q(y) - \frac{σ^4}{8} \nabla \left( \frac{1}{2} \| \nabla \log q(y) \|^2 + \nabla \cdot \nabla \log q(y) \right) . \]

MLNov 3, 2024
Denoising Diffusions with Optimal Transport: Localization, Curvature, and Multi-Scale Complexity

Tengyuan Liang, Kulunu Dharmakeerthi, Takuya Koriyama

Adding noise is easy; what about denoising? Diffusion is easy; what about reverting a diffusion? Diffusion-based generative models aim to denoise a Langevin diffusion chain, moving from a log-concave equilibrium measure $ν$, say isotropic Gaussian, back to a complex, possibly non-log-concave initial measure $μ$. The score function performs denoising, going backward in time, predicting the conditional mean of the past location given the current. We show that score denoising is the optimal backward map in transportation cost. What is its localization uncertainty? We show that the curvature function determines this localization uncertainty, measured as the conditional variance of the past location given the current. We study in this paper the effectiveness of the diffuse-then-denoise process: the contraction of the forward diffusion chain, offset by the possible expansion of the backward denoising chain, governs the denoising difficulty. For any initial measure $μ$, we prove that this offset net contraction at time $t$ is characterized by the curvature complexity of a smoothed $μ$ at a specific signal-to-noise ratio (SNR) scale $r(t)$. We discover that the multi-scale curvature complexity collectively determines the difficulty of the denoising chain. Our multi-scale complexity quantifies a fine-grained notion of average-case curvature instead of the worst-case. Curiously, it depends on an integrated tail function, measuring the relative mass of locations with positive curvature versus those with negative curvature; denoising at a specific SNR scale is easy if such an integrated tail is light. We conclude with several non-log-concave examples to demonstrate how the multi-scale complexity probes the bottleneck SNR for the diffuse-then-denoise process.

MLApr 12, 2025
No-Regret Generative Modeling via Parabolic Monge-Ampère PDE

Nabarun Deb, Tengyuan Liang

We introduce a novel generative modeling framework based on a discretized parabolic Monge-Ampère PDE, which emerges as a continuous limit of the Sinkhorn algorithm commonly used in optimal transport. Our method performs iterative refinement in the space of Brenier maps using a mirror gradient descent step. We establish theoretical guarantees for generative modeling through the lens of no-regret analysis, demonstrating that the iterates converge to the optimal Brenier map under a variety of step-size schedules. As a technical contribution, we derive a new Evolution Variational Inequality tailored to the parabolic Monge-Ampère PDE, connecting geometry, transportation cost, and regret. Our framework accommodates non-log-concave target distributions, constructs an optimal sampling process via the Brenier map, and integrates favorable learning techniques from generative adversarial networks and score-based diffusion models. As direct applications, we illustrate how our theory paves new pathways for generative modeling and variational inference.

LGJun 22, 2024
Learning When the Concept Shifts: Confounding, Invariance, and Dimension Reduction

Kulunu Dharmakeerthi, YoonHaeng Hur, Tengyuan Liang

Practitioners often deploy a learned prediction model in a new environment where the joint distribution of covariate and response has shifted. In observational data, the distribution shift is often driven by unobserved confounding factors lurking in the environment, with the underlying mechanism unknown. Confounding can obfuscate the definition of the best prediction model (concept shift) and shift covariates to domains yet unseen (covariate shift). Therefore, a model maximizing prediction accuracy in the source environment could suffer a significant accuracy drop in the target environment. This motivates us to study the domain adaptation problem with observational data: given labeled covariate and response pairs from a source environment, and unlabeled covariates from a target environment, how can one predict the missing target response reliably? We root the adaptation problem in a linear structural causal model to address endogeneity and unobserved confounding. We study the necessity and benefit of leveraging exogenous, invariant covariate representations to cure concept shifts and improve target prediction. This further motivates a new representation learning method for adaptation that optimizes for a lower-dimensional linear subspace and, subsequently, a prediction model confined to that subspace. The procedure operates on a non-convex objective-that naturally interpolates between predictability and stability/invariance-constrained on the Stiefel manifold. We study the optimization landscape and prove that, when the regularization is sufficient, nearly all local optima align with an invariant linear subspace resilient to both concept and covariate shift. In terms of predictability, we show a model that uses the learned lower-dimensional subspace can incur a nearly ideal gap between target and source risk. Three real-world data sets are investigated to validate our method and theory.

LGFeb 9, 2022
Online Learning to Transport via the Minimal Selection Principle

Wenxuan Guo, YoonHaeng Hur, Tengyuan Liang et al.

Motivated by robust dynamic resource allocation in operations research, we study the \textit{Online Learning to Transport} (OLT) problem where the decision variable is a probability measure, an infinite-dimensional object. We draw connections between online learning, optimal transport, and partial differential equations through an insight called the minimal selection principle, originally studied in the Wasserstein gradient flow setting by \citet{Ambrosio_2005}. This allows us to extend the standard online learning framework to the infinite-dimensional setting seamlessly. Based on our framework, we derive a novel method called the \textit{minimal selection or exploration (MSoE) algorithm} to solve OLT problems using mean-field approximation and discretization techniques. In the displacement convex setting, the main theoretical message underpinning our approach is that minimizing transport cost over time (via the minimal selection principle) ensures optimal cumulative regret upper bounds. On the algorithmic side, our MSoE algorithm applies beyond the displacement convex setting, making the mathematical theory of optimal transport practically relevant to non-convex settings common in dynamic resource allocation.

MESep 28, 2021
Reversible Gromov-Monge Sampler for Simulation-Based Inference

YoonHaeng Hur, Wenxuan Guo, Tengyuan Liang

This paper introduces a new simulation-based inference procedure to model and sample from multi-dimensional probability distributions given access to i.i.d.\ samples, circumventing the usual approaches of explicitly modeling the density function or designing Markov chain Monte Carlo. Motivated by the seminal work on distance and isomorphism between metric measure spaces, we propose a new notion called the Reversible Gromov-Monge (RGM) distance and study how RGM can be used to design new transform samplers to perform simulation-based inference. Our RGM sampler can also estimate optimal alignments between two heterogeneous metric measure spaces $(\cX, μ, c_{\cX})$ and $(\cY, ν, c_{\cY})$ from empirical data sets, with estimated maps that approximately push forward one measure $μ$ to the other $ν$, and vice versa. We study the analytic properties of the RGM distance and derive that under mild conditions, RGM equals the classic Gromov-Wasserstein distance. Curiously, drawing a connection to Brenier's polar factorization, we show that the RGM sampler induces bias towards strong isomorphism with proper choices of $c_{\cX}$ and $c_{\cY}$. Statistical rate of convergence, representation, and optimization questions regarding the induced sampler are studied. Synthetic and real-world examples showcasing the effectiveness of the RGM sampler are also demonstrated.

MLMar 31, 2021
Universal Prediction Band via Semi-Definite Programming

Tengyuan Liang

We propose a computationally efficient method to construct nonparametric, heteroscedastic prediction bands for uncertainty quantification, with or without any user-specified predictive model. Our approach provides an alternative to the now-standard conformal prediction for uncertainty quantification, with novel theoretical insights and computational advantages. The data-adaptive prediction band is universally applicable with minimal distributional assumptions, has strong non-asymptotic coverage properties, and is easy to implement using standard convex programs. Our approach can be viewed as a novel variance interpolation with confidence and further leverages techniques from semi-definite programming and sum-of-squares optimization. Theoretical and numerical performances for the proposed approach for uncertainty quantification are analyzed.

MLJan 28, 2021
Interpolating Classifiers Make Few Mistakes

Tengyuan Liang, Benjamin Recht

This paper provides elementary analyses of the regret and generalization of minimum-norm interpolating classifiers (MNIC). The MNIC is the function of smallest Reproducing Kernel Hilbert Space norm that perfectly interpolates a label pattern on a finite data set. We derive a mistake bound for MNIC and a regularized variant that holds for all data sets. This bound follows from elementary properties of matrix inverses. Under the assumption that the data is independently and identically distributed, the mistake bound implies that MNIC generalizes at a rate proportional to the norm of the interpolating solution and inversely proportional to the number of data points. This rate matches similar rates derived for margin classifiers and perceptrons. We derive several plausible generative models where the norm of the interpolating classifier is bounded or grows at a rate sublinear in $n$. We also show that as long as the population class conditional distributions are sufficiently separable in total variation, then MNIC generalizes with a fast rate.

EMOct 28, 2020
Deep Learning for Individual Heterogeneity

Max H. Farrell, Tengyuan Liang, Sanjog Misra

This paper integrates deep neural networks (DNNs) into structural economic models to increase flexibility and capture rich heterogeneity while preserving interpretability. Economic structure and machine learning are complements in empirical modeling, not substitutes: DNNs provide the capacity to learn complex, non-linear heterogeneity patterns, while the structural model ensures the estimates remain interpretable and suitable for decision making and policy analysis. We start with a standard parametric structural model and then enrich its parameters into fully flexible functions of observables, which are estimated using a particular DNN architecture whose structure reflects the economic model. We illustrate our framework by studying demand estimation in consumer choice. We show that by enriching a standard demand model we can capture rich heterogeneity, and further, exploit this heterogeneity to create a personalized pricing strategy. This type of optimization is not possible without economic structure, but cannot be heterogeneous without machine learning. Finally, we provide theoretical justification of each step in our proposed methodology. We first establish non-asymptotic bounds and convergence rates of our structural deep learning approach. Next, a novel and quite general influence function calculation allows for feasible inference via double machine learning in a wide variety of contexts. These results may be of interest in many other contexts, as they generalize prior work.

MLApr 9, 2020
Mehler's Formula, Branching Process, and Compositional Kernels of Deep Neural Networks

Tengyuan Liang, Hai Tran-Bach

We utilize a connection between compositional kernels and branching processes via Mehler's formula to study deep neural networks. This new probabilistic insight provides us a novel perspective on the mathematical role of activation functions in compositional neural networks. We study the unscaled and rescaled limits of the compositional kernels and explore the different phases of the limiting behavior, as the compositional depth increases. We investigate the memorization capacity of the compositional kernels and neural networks by characterizing the interplay among compositional depth, sample size, dimensionality, and non-linearity of the activation. Explicit formulas on the eigenvalues of the compositional kernel are provided, which quantify the complexity of the corresponding reproducing kernel Hilbert space. On the methodological front, we propose a new random features algorithm, which compresses the compositional layers by devising a new activation function.

STFeb 5, 2020
A Precise High-Dimensional Asymptotic Theory for Boosting and Minimum-$\ell_1$-Norm Interpolated Classifiers

Tengyuan Liang, Pragya Sur

This paper establishes a precise high-dimensional asymptotic theory for boosting on separable data, taking statistical and computational perspectives. We consider a high-dimensional setting where the number of features (weak learners) $p$ scales with the sample size $n$, in an overparametrized regime. Under a class of statistical models, we provide an exact analysis of the generalization error of boosting when the algorithm interpolates the training data and maximizes the empirical $\ell_1$-margin. Further, we explicitly pin down the relation between the boosting test error and the optimal Bayes error, as well as the proportion of active features at interpolation (with zero initialization). In turn, these precise characterizations answer certain questions raised in \cite{breiman1999prediction, schapire1998boosting} surrounding boosting, under assumed data generating processes. At the heart of our theory lies an in-depth study of the maximum-$\ell_1$-margin, which can be accurately described by a new system of non-linear equations; to analyze this margin, we rely on Gaussian comparison techniques and develop a novel uniform deviation argument. Our statistical and computational arguments can handle (1) any finite-rank spiked covariance model for the feature distribution and (2) variants of boosting corresponding to general $\ell_q$-geometry, $q \in [1, 2]$. As a final component, via the Lindeberg principle, we establish a universality result showcasing that the scaled $\ell_1$-margin (asymptotically) remains the same, whether the covariates used for boosting arise from a non-linear random feature model or an appropriately linearized model with matching moments.

STNov 2, 2019
Estimating Certain Integral Probability Metric (IPM) is as Hard as Estimating under the IPM

Tengyuan Liang

We study the minimax optimal rates for estimating a range of Integral Probability Metrics (IPMs) between two unknown probability measures, based on $n$ independent samples from them. Curiously, we show that estimating the IPM itself between probability measures, is not significantly easier than estimating the probability measures under the IPM. We prove that the minimax optimal rates for these two problems are multiplicatively equivalent, up to a $\log \log (n)/\log (n)$ factor.

STAug 27, 2019
On the Minimax Optimality of Estimating the Wasserstein Metric

Tengyuan Liang

We study the minimax optimal rate for estimating the Wasserstein-$1$ metric between two unknown probability measures based on $n$ i.i.d. empirical samples from them. We show that estimating the Wasserstein metric itself between probability measures, is not significantly easier than estimating the probability measures under the Wasserstein metric. We prove that the minimax optimal rates for these two problems are multiplicatively equivalent, up to a $\log \log (n)/\log (n)$ factor.

STAug 27, 2019
On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels

Tengyuan Liang, Alexander Rakhlin, Xiyu Zhai

We study the risk of minimum-norm interpolants of data in Reproducing Kernel Hilbert Spaces. Our upper bounds on the risk are of a multiple-descent shape for the various scalings of $d = n^α$, $α\in(0,1)$, for the input dimension $d$ and sample size $n$. Empirical evidence supports our finding that minimum-norm interpolants in RKHS can exhibit this unusual non-monotonicity in sample size; furthermore, locations of the peaks in our experiments match our theoretical predictions. Since gradient flow on appropriately initialized wide neural networks converges to a minimum-norm interpolant with respect to a certain kernel, our analysis also yields novel estimation and generalization guarantees for these over-parametrized models. At the heart of our analysis is a study of spectral properties of the random kernel matrix restricted to a filtration of eigen-spaces of the population covariance operator, and may be of independent interest.

MLJan 21, 2019
Training Neural Networks as Learning Data-adaptive Kernels: Provable Representation and Approximation Benefits

Xialiang Dou, Tengyuan Liang

Consider the problem: given the data pair $(\mathbf{x}, \mathbf{y})$ drawn from a population with $f_*(x) = \mathbf{E}[\mathbf{y} | \mathbf{x} = x]$, specify a neural network model and run gradient flow on the weights over time until reaching any stationarity. How does $f_t$, the function computed by the neural network at time $t$, relate to $f_*$, in terms of approximation and representation? What are the provable benefits of the adaptive representation by neural networks compared to the pre-specified fixed basis representation in the classical nonparametric literature? We answer the above questions via a dynamic reproducing kernel Hilbert space (RKHS) approach indexed by the training process of neural networks. Firstly, we show that when reaching any local stationarity, gradient flow learns an adaptive RKHS representation and performs the global least-squares projection onto the adaptive RKHS, simultaneously. Secondly, we prove that as the RKHS is data-adaptive and task-specific, the residual for $f_*$ lies in a subspace that is potentially much smaller than the orthogonal complement of the RKHS. The result formalizes the representation and approximation benefits of neural networks. Lastly, we show that the neural network function computed by gradient flow converges to the kernel ridgeless regression with an adaptive kernel, in the limit of vanishing regularization. The adaptive kernel viewpoint provides new angles of studying the approximation, representation, generalization, and optimization advantages of neural networks.

STNov 7, 2018
How Well Generative Adversarial Networks Learn Distributions

Tengyuan Liang

This paper studies the rates of convergence for learning distributions implicitly with the adversarial framework and Generative Adversarial Networks (GANs), which subsume Wasserstein, Sobolev, MMD GAN, and Generalized/Simulated Method of Moments (GMM/SMM) as special cases. We study a wide range of parametric and nonparametric target distributions under a host of objective evaluation metrics. We investigate how to obtain valid statistical guarantees for GANs through the lens of regularization. On the nonparametric end, we derive the optimal minimax rates for distribution estimation under the adversarial framework. On the parametric end, we establish a theory for general neural network classes (including deep leaky ReLU networks) that characterizes the interplay on the choice of generator and discriminator pair. We discover and isolate a new notion of regularization, called the generator-discriminator-pair regularization, that sheds light on the advantage of GANs compared to classical parametric and nonparametric approaches for explicit distribution estimation. We develop novel oracle inequalities as the main technical tools for analyzing GANs, which are of independent interest.

EMSep 26, 2018
Deep Neural Networks for Estimation and Inference

Max H. Farrell, Tengyuan Liang, Sanjog Misra

We study deep neural networks and their use in semiparametric inference. We establish novel rates of convergence for deep feedforward neural nets. Our new rates are sufficiently fast (in some cases minimax optimal) to allow us to establish valid second-step inference after first-step estimation with deep learning, a result also new to the literature. Our estimation rates and semiparametric inference results handle the current standard architecture: fully connected feedforward neural networks (multi-layer perceptrons), with the now-common rectified linear unit activation function and a depth explicitly diverging with the sample size. We discuss other architectures as well, including fixed-width, very deep networks. We establish nonasymptotic bounds for these deep nets for a general class of nonparametric regression-type loss functions, which includes as special cases least squares, logistic regression, and other generalized linear models. We then apply our theory to develop semiparametric inference, focusing on causal parameters for concreteness, such as treatment effects, expected welfare, and decomposition effects. Inference in many other semiparametric contexts can be readily obtained. We demonstrate the effectiveness of deep learning with a Monte Carlo analysis and an empirical application to direct mail marketing.

STAug 1, 2018
Just Interpolate: Kernel "Ridgeless" Regression Can Generalize

Tengyuan Liang, Alexander Rakhlin

In the absence of explicit regularization, Kernel "Ridgeless" Regression with nonlinear kernels has the potential to fit the training data perfectly. It has been observed empirically, however, that such interpolated solutions can still generalize well on test data. We isolate a phenomenon of implicit regularization for minimum-norm interpolated solutions which is due to a combination of high dimensionality of the input data, curvature of the kernel function, and favorable geometric properties of the data such as an eigenvalue decay of the empirical covariance and kernel matrices. In addition to deriving a data-dependent upper bound on the out-of-sample error, we present experimental evidence suggesting that the phenomenon occurs in the MNIST dataset.

LGFeb 18, 2018
Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability

Belinda Tzen, Tengyuan Liang, Maxim Raginsky

We study the detailed path-wise behavior of the discrete-time Langevin algorithm for non-convex Empirical Risk Minimization (ERM) through the lens of metastability, adopting some techniques from Berglund and Gentz (2003. For a particular local optimum of the empirical risk, with an arbitrary initialization, we show that, with high probability, at least one of the following two events will occur: (1) the Langevin trajectory ends up somewhere outside the $\varepsilon$-neighborhood of this particular optimum within a short recurrence time; (2) it enters this $\varepsilon$-neighborhood by the recurrence time and stays there until a potentially exponentially long escape time. We call this phenomenon empirical metastability. This two-timescale characterization aligns nicely with the existing literature in the following two senses. First, the effective recurrence time (i.e., number of iterations multiplied by stepsize) is dimension-independent, and resembles the convergence time of continuous-time deterministic Gradient Descent (GD). However unlike GD, the Langevin algorithm does not require strong conditions on local initialization, and has the possibility of eventually visiting all optima. Second, the scaling of the escape time is consistent with the Eyring-Kramers law, which states that the Langevin scheme will eventually visit all local minima, but it will take an exponentially long time to transit among them. We apply this path-wise concentration result in the context of statistical learning to examine local notions of generalization and optimality.

MLFeb 16, 2018
Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks

Tengyuan Liang, James Stokes

Motivated by the pursuit of a systematic computational and algorithmic understanding of Generative Adversarial Networks (GANs), we present a simple yet unified non-asymptotic local convergence theory for smooth two-player games, which subsumes several discrete-time gradient-based saddle point dynamics. The analysis reveals the surprising nature of the off-diagonal interaction term as both a blessing and a curse. On the one hand, this interaction term explains the origin of the slow-down effect in the convergence of Simultaneous Gradient Ascent (SGA) to stable Nash equilibria. On the other hand, for the unstable equilibria, exponential convergence can be proved thanks to the interaction term, for four modified dynamics proposed to stabilize GAN training: Optimistic Mirror Descent (OMD), Consensus Optimization (CO), Implicit Updates (IU) and Predictive Method (PM). The analysis uncovers the intimate connections among these stabilizing techniques, and provides detailed characterization on the choice of learning rate. As a by-product, we present a new analysis for OMD proposed in Daskalakis, Ilyas, Syrgkanis, and Zeng [2017] with improved rates.

MLDec 21, 2017
How Well Can Generative Adversarial Networks Learn Densities: A Nonparametric View

Tengyuan Liang

We study in this paper the rate of convergence for learning densities under the Generative Adversarial Networks (GAN) framework, borrowing insights from nonparametric statistics. We introduce an improved GAN estimator that achieves a faster rate, through simultaneously leveraging the level of smoothness in the target density and the evaluation metric, which in theory remedies the mode collapse problem reported in the literature. A minimax lower bound is constructed to show that when the dimension is large, the exponent in the rate for the new GAN estimator is near optimal. One can view our results as answering in a quantitative way how well GAN learns a wide range of densities with different smoothness properties, under a hierarchy of evaluation metrics. As a byproduct, we also obtain improved generalization bounds for GAN with deeper ReLU discriminator network.

MLDec 20, 2017
Statistical Inference for the Population Landscape via Moment Adjusted Stochastic Gradients

Tengyuan Liang, Weijie Su

Modern statistical inference tasks often require iterative optimization methods to compute the solution. Convergence analysis from an optimization viewpoint only informs us how well the solution is approximated numerically but overlooks the sampling nature of the data. In contrast, recognizing the randomness in the data, statisticians are keen to provide uncertainty quantification, or confidence, for the solution obtained using iterative optimization methods. This paper makes progress along this direction by introducing the moment-adjusted stochastic gradient descents, a new stochastic optimization method for statistical inference. We establish non-asymptotic theory that characterizes the statistical distribution for certain iterative methods with optimization guarantees. On the statistical front, the theory allows for model mis-specification, with very mild conditions on the data. For optimization, the theory is flexible for both convex and non-convex cases. Remarkably, the moment-adjusting idea motivated from "error standardization" in statistics achieves a similar effect as acceleration in first-order optimization methods used to fit generalized linear models. We also demonstrate this acceleration effect in the non-convex setting through numerical experiments.

LGNov 5, 2017
Fisher-Rao Metric, Geometry, and Complexity of Neural Networks

Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin et al.

We study the relationship between geometry and capacity measures for deep neural networks from an invariance viewpoint. We introduce a new notion of capacity --- the Fisher-Rao norm --- that possesses desirable invariance properties and is motivated by Information Geometry. We discover an analytical characterization of the new capacity measure, through which we establish norm-comparison inequalities and further show that the new measure serves as an umbrella for several existing norm-based complexity measures. We discuss upper bounds on the generalization error induced by the proposed measure. Extensive numerical experiments on CIFAR-10 support our theoretical findings. Our theoretical analysis rests on a key structural lemma about partial derivatives of multi-layer rectifier networks.

STSep 12, 2017
Weighted Message Passing and Minimum Energy Flow for Heterogeneous Stochastic Block Models with Side Information

T. Tony Cai, Tengyuan Liang, Alexander Rakhlin

We study the misclassification error for community detection in general heterogeneous stochastic block models (SBM) with noisy or partial label information. We establish a connection between the misclassification rate and the notion of minimum energy on the local neighborhood of the SBM. We develop an optimally weighted message passing algorithm to reconstruct labels for SBM based on the minimum energy flow and the eigenvectors of a certain Markov transition matrix. The general SBM considered in this paper allows for unequal-size communities, degree heterogeneity, and different connection probabilities among blocks. We focus on how to optimally weigh the message passing to improve misclassification.

LGJun 14, 2017
Adaptive Feature Selection: Computationally Efficient Online Sparse Linear Regression under RIP

Satyen Kale, Zohar Karnin, Tengyuan Liang et al.

Online sparse linear regression is an online problem where an algorithm repeatedly chooses a subset of coordinates to observe in an adversarially chosen feature vector, makes a real-valued prediction, receives the true label, and incurs the squared loss. The goal is to design an online learning algorithm with sublinear regret to the best sparse linear predictor in hindsight. Without any assumptions, this problem is known to be computationally intractable. In this paper, we make the assumption that data matrix satisfies restricted isometry property, and show that this assumption leads to computationally efficient algorithms with sublinear regret for two variants of the problem. In the first variant, the true label is generated according to a sparse linear model with additive Gaussian noise. In the second, the true label is chosen adversarially.

STApr 21, 2016
On Detection and Structural Reconstruction of Small-World Random Networks

T. Tony Cai, Tengyuan Liang, Alexander Rakhlin

In this paper, we study detection and fast reconstruction of the celebrated Watts-Strogatz (WS) small-world random graph model \citep{watts1998collective} which aims to describe real-world complex networks that exhibit both high clustering and short average length properties. The WS model with neighborhood size $k$ and rewiring probability probability $β$ can be viewed as a continuous interpolation between a deterministic ring lattice graph and the Erdős-Rényi random graph. We study both the computational and statistical aspects of detecting the deterministic ring lattice structure (or local geographical links, strong ties) in the presence of random connections (or long range links, weak ties), and for its recovery. The phase diagram in terms of $(k,β)$ is partitioned into several regions according to the difficulty of the problem. We propose distinct methods for the various regions.

STMar 22, 2016
Inference via Message Passing on Partially Labeled Stochastic Block Models

T. Tony Cai, Tengyuan Liang, Alexander Rakhlin

We study the community detection and recovery problem in partially-labeled stochastic block models (SBM). We develop a fast linearized message-passing algorithm to reconstruct labels for SBM (with $n$ nodes, $k$ blocks, $p,q$ intra and inter block connectivity) when $δ$ proportion of node labels are revealed. The signal-to-noise ratio ${\sf SNR}(n,k,p,q,δ)$ is shown to characterize the fundamental limitations of inference via local algorithms. On the one hand, when ${\sf SNR}>1$, the linearized message-passing algorithm provides the statistical inference guarantee with mis-classification rate at most $\exp(-({\sf SNR}-1)/2)$, thus interpolating smoothly between strong and weak consistency. This exponential dependence improves upon the known error rate $({\sf SNR}-1)^{-1}$ in the literature on weak recovery. On the other hand, when ${\sf SNR}<1$ (for $k=2$) and ${\sf SNR}<1/4$ (for general growing $k$), we prove that local algorithms suffer an error rate at least $\frac{1}{2} - \sqrt{δ\cdot {\sf SNR}}$, which is only slightly better than random guess for small $δ$.

MLFeb 21, 2015
Learning with Square Loss: Localization through Offset Rademacher Complexity

Tengyuan Liang, Alexander Rakhlin, Karthik Sridharan

We consider regression with square loss and general classes of functions without the boundedness assumption. We introduce a notion of offset Rademacher complexity that provides a transparent way to study localization both in expectation and in high probability. For any (possibly non-convex) class, the excess loss of a two-step estimator is shown to be upper bounded by this offset complexity through a novel geometric inequality. In the convex case, the estimator reduces to an empirical risk minimizer. The method recovers the results of \citep{RakSriTsy15} for the bounded case while also providing guarantees without the boundedness assumption.

STFeb 6, 2015
Computational and Statistical Boundaries for Submatrix Localization in a Large Noisy Matrix

T. Tony Cai, Tengyuan Liang, Alexander Rakhlin

The interplay between computational efficiency and statistical accuracy in high-dimensional inference has drawn increasing attention in the literature. In this paper, we study computational and statistical boundaries for submatrix localization. Given one observation of (one or multiple non-overlapping) signal submatrix (of magnitude $λ$ and size $k_m \times k_n$) contaminated with a noise matrix (of size $m \times n$), we establish two transition thresholds for the signal to noise $λ/σ$ ratio in terms of $m$, $n$, $k_m$, and $k_n$. The first threshold, $\sf SNR_c$, corresponds to the computational boundary. Below this threshold, it is shown that no polynomial time algorithm can succeed in identifying the submatrix, under the \textit{hidden clique hypothesis}. We introduce adaptive linear time spectral algorithms that identify the submatrix with high probability when the signal strength is above the threshold $\sf SNR_c$. The second threshold, $\sf SNR_s$, captures the statistical boundary, below which no method can succeed with probability going to one in the minimax sense. The exhaustive search method successfully finds the submatrix above this threshold. The results show an interesting phenomenon that $\sf SNR_c$ is always significantly larger than $\sf SNR_s$, which implies an essential gap between statistical optimality and computational efficiency for submatrix localization.

NAJan 28, 2015
Escaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex Functions

Alexandre Belloni, Tengyuan Liang, Hariharan Narayanan et al.

We consider the problem of optimizing an approximately convex function over a bounded convex set in $\mathbb{R}^n$ using only function evaluations. The problem is reduced to sampling from an \emph{approximately} log-concave distribution using the Hit-and-Run method, which is shown to have the same $\mathcal{O}^*$ complexity as sampling from log-concave distributions. In addition to extend the analysis for log-concave distributions to approximate log-concave distributions, the implementation of the 1-dimensional sampler of the Hit-and-Run walk requires new methods and analysis. The algorithm then is based on simulated annealing which does not relies on first order conditions which makes it essentially immune to local minima. We then apply the method to different motivating problems. In the context of zeroth order stochastic convex optimization, the proposed method produces an $ε$-minimizer after $\mathcal{O}^*(n^{7.5}ε^{-2})$ noisy function evaluations by inducing a $\mathcal{O}(ε/n)$-approximately log concave distribution. We also consider in detail the case when the "amount of non-convexity" decays towards the optimum of the function. Other applications of the method discussed in this work include private computation of empirical risk minimizers, two-stage stochastic programming, and approximate dynamic programming for online learning.

STApr 17, 2014
Geometric Inference for General High-Dimensional Linear Inverse Problems

T. Tony Cai, Tengyuan Liang, Alexander Rakhlin

This paper presents a unified geometric framework for the statistical analysis of a general ill-posed linear inverse model which includes as special cases noisy compressed sensing, sign vector recovery, trace regression, orthogonal matrix estimation, and noisy matrix completion. We propose computationally feasible convex programs for statistical inference including estimation, confidence intervals and hypothesis testing. A theoretical framework is developed to characterize the local estimation rate of convergence and to provide statistical inference guarantees. Our results are built based on the local conic geometry and duality. The difficulty of statistical inference is captured by the geometric characterization of the local tangent cone through the Gaussian width and Sudakov minoration estimate.

LGFeb 11, 2014
On Zeroth-Order Stochastic Convex Optimization via Random Walks

Tengyuan Liang, Hariharan Narayanan, Alexander Rakhlin

We propose a method for zeroth order stochastic convex optimization that attains the suboptimality rate of $\tilde{\mathcal{O}}(n^{7}T^{-1/2})$ after $T$ queries for a convex bounded function $f:{\mathbb R}^n\to{\mathbb R}$. The method is based on a random walk (the \emph{Ball Walk}) on the epigraph of the function. The randomized approach circumvents the problem of gradient estimation, and appears to be less sensitive to noisy function evaluations compared to noiseless zeroth order methods.