Matus Telgarsky

LG
h-index26
41papers
6,022citations
Novelty56%
AI Score45

41 Papers

OCMay 6, 2022
Astral Space: Convex Analysis at Infinity

Miroslav Dudík, Robert E. Schapire, Matus Telgarsky

Not all convex functions on $\mathbb{R}^n$ have finite minimizers; some can only be minimized by a sequence as it heads to infinity. In this work, we aim to develop a theory for understanding such minimizers at infinity. We study astral space, a compact extension of $\mathbb{R}^n$ to which such points at infinity have been added. Astral space is constructed to be as small as possible while still ensuring that all linear functions can be continuously extended to the new space. Although astral space includes all of $\mathbb{R}^n$, it is not a vector space, nor even a metric space. However, it is sufficiently well-structured to allow useful and meaningful extensions of concepts of convexity, conjugacy, and subdifferentials. We develop these concepts and analyze various properties of convex functions on astral space, including the detailed structure of their minimizers, exact characterizations of continuity, and convergence of descent algorithms.

LGJun 5, 2023
Representational Strengths and Limitations of Transformers

Clayton Sanford, Daniel Hsu, Matus Telgarsky

Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. In this work we establish both positive and negative results on the representation power of attention layers, with a focus on intrinsic complexity parameters such as width, depth, and embedding dimension. On the positive side, we present a sparse averaging task, where recurrent networks and feedforward networks all have complexity scaling polynomially in the input size, whereas transformers scale merely logarithmically in the input size; furthermore, we use the same construction to show the necessity and role of a large embedding dimension in a transformer. On the negative side, we present a triple detection task, where attention layers in turn have complexity scaling linearly in the input size; as this scenario seems rare in practice, we also present natural variants that can be efficiently solved by attention layers. The proof techniques emphasize the value of communication complexity in the analysis of transformers and related models, and the role of sparse averaging as a prototypical attention task, which even finds use in the analysis of triple detection.

LGAug 26, 2024
One-layer transformers fail to solve the induction heads task

Clayton Sanford, Daniel Hsu, Matus Telgarsky

A simple communication complexity argument proves that no one-layer transformer can solve the induction heads task unless its size is exponentially larger than the size sufficient for a two-layer transformer.

LGAug 4, 2022
Feature selection with gradient descent on two-layer networks in low-rotation regimes

Matus Telgarsky

This work establishes low test error of gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint), and making use of margins as the core analytic technique. The first regime is near initialization, specifically until the weights have moved by $\mathcal{O}(\sqrt m)$, where $m$ denotes the network width, which is in sharp contrast to the $\mathcal{O}(1)$ weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains at least the NTK margin itself, which suffices to establish escape from bad KKT points of the margin objective, whereas prior work could only establish nondecreasing but arbitrarily small margins. The second regime is the Neural Collapse (NC) setting, where data lies in extremely-well-separated groups, and the sample complexity scales with the number of groups; here the contribution over prior work is an analysis of the entire GF trajectory from initialization. Lastly, if the inner layer weights are constrained to change in norm only and can not rotate, then GF with large widths achieves globally maximal margins, and its sample complexity scales with their inverse; this is in contrast to prior work, which required infinite width and a tricky dual convergence assumption. As purely technical contributions, this work develops a variety of potential functions and other tools which will hopefully aid future work.

LGJun 13, 2023
On Achieving Optimal Adversarial Test Error

Justin D. Li, Matus Telgarsky

We first elucidate various fundamental properties of optimal adversarial predictors: the structure of optimal adversarial convex predictors in terms of optimal adversarial zero-one predictors, bounds relating the adversarial convex loss to the adversarial zero-one loss, and the fact that continuous predictors can get arbitrarily close to the optimal adversarial error for both convex and zero-one losses. Applying these results along with new Rademacher complexity bounds for adversarial training near initialization, we prove that for general data distributions and perturbation sets, adversarial training on shallow networks with early stopping and an idealized optimal adversary is able to achieve optimal adversarial test error. By contrast, prior theoretical work either considered specialized data distributions or only provided training error guarantees.

STDec 31, 2025
Basic Inequalities for First-Order Optimization with Applications to Statistical Risk Analysis

Seunghoon Paik, Kangjie Zhou, Matus Telgarsky et al.

We introduce \textit{basic inequalities} for first-order iterative optimization algorithms, forming a simple and versatile framework that connects implicit and explicit regularization. While related inequalities appear in the literature, we isolate and highlight a specific form and develop it as a well-rounded tool for statistical analysis. Let $f$ denote the objective function to be optimized. Given a first-order iterative algorithm initialized at $θ_0$ with current iterate $θ_T$, the basic inequality upper bounds $f(θ_T)-f(z)$ for any reference point $z$ in terms of the accumulated step sizes and the distances between $θ_0$, $θ_T$, and $z$. The bound translates the number of iterations into an effective regularization coefficient in the loss function. We demonstrate this framework through analyses of training dynamics and prediction risk bounds. In addition to revisiting and refining known results on gradient descent, we provide new results for mirror descent with Bregman divergence projection, for generalized linear models trained by gradient descent and exponentiated gradient descent, and for randomized predictors. We illustrate and supplement these theoretical findings with experiments on generalized linear models.

LGFeb 25, 2024Code
Spectrum Extraction and Clipping for Implicitly Linear Layers

Ali Ebrahimpour Boroojeny, Matus Telgarsky, Hari Sundaram

We show the effectiveness of automatic differentiation in efficiently and correctly computing and controlling the spectrum of implicitly linear operators, a rich family of layer types including all standard convolutional and dense layers. We provide the first clipping method which is correct for general convolution layers, and illuminate the representational limitation that caused correctness issues in prior work. We study the effect of the batch normalization layers when concatenated with convolutional layers and show how our clipping method can be applied to their composition. By comparing the accuracy and performance of our algorithms to the state-of-the-art methods, using various experiments, we show they are more precise and efficient and lead to better generalization and adversarial robustness. We provide the code for using our methods at https://github.com/Ali-E/FastClip.

LGFeb 14, 2024
Transformers, parallel computation, and logarithmic depth

Clayton Sanford, Daniel Hsu, Matus Telgarsky

We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.

LGFeb 24, 2024
Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

Jingfeng Wu, Peter L. Bartlett, Matus Telgarsky et al. · berkeley

We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $η$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(η)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (ηt) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T^2)$ with an aggressive stepsize $η:= Θ( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T^2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.

LGFeb 18, 2025
Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression

Jingfeng Wu, Peter Bartlett, Matus Telgarsky et al. · berkeley

In overparameterized logistic regression, gradient descent (GD) iterates diverge in norm while converging in direction to the maximum $\ell_2$-margin solution -- a phenomenon known as the implicit bias of GD. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression. We first demonstrate that the excess logistic risk vanishes for early-stopped GD but diverges to infinity for GD iterates at convergence. This suggests that early-stopped GD is well-calibrated, whereas asymptotic GD is statistically inconsistent. Second, we show that to attain a small excess zero-one risk, polynomially many samples are sufficient for early-stopped GD, while exponentially many samples are necessary for any interpolating estimator, including asymptotic GD. This separation underscores the statistical benefits of early stopping in the overparameterized regime. Finally, we establish nonasymptotic bounds on the norm and angular differences between early-stopped GD and $\ell_2$-regularized empirical risk minimizer, thereby connecting the implicit regularization of GD with explicit $\ell_2$-regularization.

LGFeb 14, 2022
Stochastic linear optimization never overfits with quadratically-bounded losses on general data

Matus Telgarsky

This work provides test error bounds for iterative fixed point methods on linear predictors -- specifically, stochastic and batch mirror descent (MD), and stochastic temporal difference learning (TD) -- with two core contributions: (a) a single proof technique which gives high probability guarantees despite the absence of projections, regularization, or any equivalents, even when optima have large or infinite norm, for quadratically-bounded losses (e.g., providing unified treatment of squared and logistic losses); (b) locally-adapted rates which depend not on global problem structure (such as condition numbers and maximum margins), but rather on properties of low norm predictors which may suffer some small excess test error. The proof technique is an elementary and versatile coupling argument, and is demonstrated here in the following settings: stochastic MD under realizability; stochastic MD for general Markov data; batch MD for general IID data; stochastic MD on heavy-tailed data (still without projections); stochastic TD on Markov chains (all prior stochastic TD bounds are in expectation).

LGOct 21, 2021
Actor-critic is implicitly biased towards high entropy optimal policies

Yuzheng Hu, Ziwei Ji, Matus Telgarsky

We show that the simplest actor-critic method -- a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or exploration -- does not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like $ε$-greedy, but is moreover trained on a single trajectory with no resets. The key consequence of the high entropy bias is that uniform mixing assumptions on the MDP, which exist in some form in all prior work, can be dropped: the implicit regularization of the high entropy bias is enough to ensure that all chains mix and an optimal policy is reached with high probability. As auxiliary contributions, this work decouples concerns between the actor and critic by writing the actor update as an explicit mirror descent, provides tools to uniformly bound mixing times within KL balls of policy space, and provides a projection-free TD analysis with its own implicit bias which can be run from an unmixed starting distribution.

LGJul 1, 2021
Fast Margin Maximization via Dual Acceleration

Ziwei Ji, Nathan Srebro, Matus Telgarsky

We present and analyze a momentum-based gradient method for training linear classifiers with an exponentially-tailed loss (e.g., the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of $\widetilde{\mathcal{O}}(1/t^2)$. This contrasts with a rate of $\mathcal{O}(1/\log(t))$ for standard gradient descent, and $\mathcal{O}(1/t)$ for normalized gradient descent. This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual, which manages to result in a simple and intuitive method in the primal. This dual view can also be used to derive a stochastic variant, which performs adaptive non-uniform sampling via the dual variables.

LGJun 10, 2021
Early-stopped neural networks are consistent

Ziwei Ji, Justin D. Li, Matus Telgarsky

This work studies the behavior of shallow ReLU networks trained with the logistic loss via gradient descent on binary classification data where the underlying data distribution is general, and the (optimal) Bayes risk is not necessarily zero. In this setting, it is shown that gradient descent with early stopping achieves population risk arbitrarily close to optimal in terms of not just logistic and misclassification losses, but also in terms of calibration, meaning the sigmoid mapping of its outputs approximates the true underlying conditional distribution arbitrarily finely. Moreover, the necessary iteration, sample, and architectural complexities of this analysis all scale naturally with a certain complexity measure of the true conditional model. Lastly, while it is not shown that early stopping is necessary, it is shown that any univariate classifier satisfying a local interpolation property is inconsistent.

LGApr 12, 2021
Generalization bounds via distillation

Daniel Hsu, Ziwei Ji, Matus Telgarsky et al.

This paper theoretically investigates the following empirical phenomenon: given a high-complexity network with poor generalization bounds, one can distill it into a network with nearly identical predictions but low complexity and vastly smaller generalization bounds. The main contribution is an analysis showing that the original network inherits this good generalization bound from its distillation, assuming the use of well-behaved data augmentation. This bound is presented both in an abstract and in a concrete form, the latter complemented by a reduction technique to handle modern computation graphs featuring convolutional layers, fully-connected layers, and skip connections, to name a few. To round out the story, a (looser) classical uniform convergence analysis of compression is also presented, as well as a variety of experiments on cifar and mnist demonstrating similar generalization performance between the original network and its distillation.

LGJun 19, 2020
Gradient descent follows the regularization path for general losses

Ziwei Ji, Miroslav Dudík, Robert E. Schapire et al.

Recent work across many machine learning disciplines has highlighted that standard descent methods, even without explicit regularization, do not merely minimize the training error, but also exhibit an implicit bias. This bias is typically towards a certain regularized solution, and relies upon the details of the learning process, for instance the use of the cross-entropy loss. In this work, we show that for empirical risk minimization over linear predictors with arbitrary convex, strictly decreasing losses, if the risk does not attain its infimum, then the gradient-descent path and the algorithm-independent regularization path converge to the same direction (whenever either converges to a direction). Using this result, we provide a justification for the widely-used exponentially-tailed losses (such as the exponential loss or the logistic loss): while this convergence to a direction for exponentially-tailed losses is necessarily to the maximum-margin direction, other losses such as polynomially-tailed losses may induce convergence to a direction with a poor margin.

LGJun 11, 2020
Directional convergence and alignment in deep learning

Ziwei Ji, Matus Telgarsky

In this paper, we show that although the minimizers of cross-entropy and related classification losses are off at infinity, network weights learned by gradient flow converge in direction, with an immediate corollary that network predictions, training errors, and the margin distribution also converge. This proof holds for deep homogeneous networks -- a broad class of networks allowing for ReLU, max-pooling, linear, and convolutional layers -- and we additionally provide empirical support not just close to the theory (e.g., the AlexNet), but also on non-homogeneous networks (e.g., the DenseNet). If the network further has locally Lipschitz gradients, we show that these gradients also converge in direction, and asymptotically align with the gradient flow path, with consequences on margin maximization, convergence of saliency maps, and a few other settings. Our analysis complements and is distinct from the well-known neural tangent and mean-field theories, and in particular makes no requirements on network width and initialization, instead merely requiring perfect classification accuracy. The proof proceeds by developing a theory of unbounded nonsmooth Kurdyka-Łojasiewicz inequalities for functions definable in an o-minimal structure, and is also applicable outside deep learning.

LGOct 15, 2019
Neural tangent kernels, transportation mappings, and universal approximation

Ziwei Ji, Matus Telgarsky, Ruicheng Xian

This paper establishes rates of universal approximation for the shallow neural tangent kernel (NTK): network weights are only allowed microscopic changes from random initialization, which entails that activations are mostly unchanged, and the network is nearly equivalent to its linearization. Concretely, the paper has two main contributions: a generic scheme to approximate functions with the NTK by sampling from transport mappings between the initial weights and their desired values, and the construction of transport mappings via Fourier transforms. Regarding the first contribution, the proof scheme provides another perspective on how the NTK regime arises from rescaling: redundancy in the weights due to resampling allows individual weights to be scaled down. Regarding the second contribution, the most notable transport mapping asserts that roughly $1 / δ^{10d}$ nodes are sufficient to approximate continuous functions, where $δ$ depends on the continuity properties of the target function. By contrast, nearly the same proof yields a bound of $1 / δ^{2d}$ for shallow ReLU networks; this gap suggests a tantalizing direction for future work, separating shallow ReLU networks and their linearization.

LGSep 26, 2019
Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Ziwei Ji, Matus Telgarsky

Recent theoretical work has guaranteed that overparameterized networks trained by gradient descent achieve arbitrarily low training error, and sometimes even low test error. The required width, however, is always polynomial in at least one of the sample size $n$, the (inverse) target error $1/ε$, and the (inverse) failure probability $1/δ$. This work shows that $\widetildeΘ(1/ε)$ iterations of gradient descent with $\widetildeΩ(1/ε^2)$ training examples on two-layer ReLU networks of any width exceeding $\mathrm{polylog}(n,1/ε,1/δ)$ suffice to achieve a test misclassification error of $ε$. We also prove that stochastic gradient descent can achieve $ε$ test error with polylogarithmic width and $\widetildeΘ(1/ε)$ samples. The analysis relies upon the separation margin of the limiting kernel, which is guaranteed positive, can distinguish between true labels and random labels, and can give a tight sample-complexity analysis in the infinite-width setting

LGJun 18, 2019
Approximation power of random neural networks

Bolton Bailey, Ziwei Ji, Matus Telgarsky et al.

This paper investigates the approximation power of three types of random neural networks: (a) infinite width networks, with weights following an arbitrary distribution; (b) finite width networks obtained by subsampling the preceding infinite width networks; (c) finite width networks obtained by starting with standard Gaussian initialization, and then adding a vanishingly small correction to the weights. The primary result is a fully quantified bound on the rate of approximation of general general continuous functions: in all three cases, a function $f$ can be approximated with complexity $\|f\|_1 (d/δ)^{\mathcal{O}(d)}$, where $δ$ depends on continuity properties of $f$ and the complexity measure depends on the weight magnitudes and/or cardinalities. Along the way, a variety of ancillary results are developed: an exact construction of Gaussian densities with infinite width networks, an elementary stand-alone proof scheme for approximation via convolutions of radial basis functions, subsampling rates for infinite width networks, and depth separation for corrected networks.

LGJun 11, 2019
Characterizing the implicit bias via a primal-dual analysis

Ziwei Ji, Matus Telgarsky

This paper shows that the implicit bias of gradient descent on linearly separable data is exactly characterized by the optimal solution of a dual optimization problem given by a smoothed margin, even for general losses. This is in contrast to prior results, which are often tailored to exponentially-tailed losses. For the exponential loss specifically, with $n$ training examples and $t$ gradient descent steps, our dual analysis further allows us to prove an $O(\ln(n)/\ln(t))$ convergence rate to the $\ell_2$ maximum margin direction, when a constant step size is used. This rate is tight in both $n$ and $t$, which has not been presented by prior work. On the other hand, with a properly chosen but aggressive step size schedule, we prove $O(1/t)$ rates for both $\ell_2$ margin maximization and implicit bias, whereas prior work (including all first-order methods for the general hard-margin linear SVM problem) proved $\widetilde{O}(1/\sqrt{t})$ margin rates, or $O(1/t)$ margin rates to a suboptimal margin, with an implied (slower) bias rate. Our key observations include that gradient descent on the primal variable naturally induces a mirror descent update on the dual variable, and that the dual objective in this setting is smooth enough to give a faster rate.

LGJun 8, 2019
A gradual, semi-discrete approach to generative network training via explicit Wasserstein minimization

Yucheng Chen, Matus Telgarsky, Chao Zhang et al.

This paper provides a simple procedure to fit generative networks to target distributions, with the goal of a small Wasserstein distance (or other optimal transport costs). The approach is based on two principles: (a) if the source randomness of the network is a continuous distribution (the "semi-discrete" setting), then the Wasserstein distance is realized by a deterministic optimal transport mapping; (b) given an optimal transport mapping between a generator network and a target distribution, the Wasserstein distance may be decreased via a regression between the generated data and the mapped target points. The procedure here therefore alternates these two steps, forming an optimal transport and regressing against it, gradually adjusting the generator network towards the target distribution. Mathematically, this approach is shown to minimize the Wasserstein distance to both the empirical target distribution, and also its underlying population counterpart. Empirically, good performance is demonstrated on the training and testing sets of the MNIST and Thin-8 data. The paper closes with a discussion of the unsuitability of the Wasserstein distance for certain tasks, as has been identified in prior work [Arora et al., 2017, Huang et al., 2017].

LGOct 26, 2018
Size-Noise Tradeoffs in Generative Networks

Bolton Bailey, Matus Telgarsky

This paper investigates the ability of generative networks to convert their input noise distributions into other distributions. Firstly, we demonstrate a construction that allows ReLU networks to increase the dimensionality of their noise distribution by implementing a "space-filling" function based on iterated tent maps. We show this construction is optimal by analyzing the number of affine pieces in functions computed by multivariate ReLU networks. Secondly, we provide efficient ways (using polylog $(1/ε)$ nodes) for networks to pass between univariate uniform and normal distributions, using a Taylor series approximation and a binary search gadget for computing function inverses. Lastly, we indicate how high dimensional distributions can be efficiently transformed into low dimensional distributions.

LGOct 4, 2018
Gradient descent aligns the layers of deep linear networks

Ziwei Ji, Matus Telgarsky

This paper establishes risk convergence and asymptotic weight matrix alignment --- a form of implicit regularization --- of gradient flow and gradient descent when applied to deep linear networks on linearly separable data. In more detail, for gradient flow applied to strictly decreasing loss functions (with similar results for gradient descent with particular decreasing step sizes): (i) the risk converges to 0; (ii) the normalized i-th weight matrix asymptotically equals its rank-1 approximation $u_iv_i^{\top}$; (iii) these rank-1 matrices are aligned across layers, meaning $|v_{i+1}^{\top}u_i|\to1$. In the case of the logistic loss (binary cross entropy), more can be said: the linear function induced by the network --- the product of its weight matrices --- converges to the same direction as the maximum margin solution. This last property was identified in prior work, but only under assumptions on gradient descent which here are implied by the alignment phenomenon.

LGMar 20, 2018
Risk and parameter convergence of logistic regression

Ziwei Ji, Matus Telgarsky

Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate $\mathcal{O}(\ln\ln t / \ln t)$. The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\mathcal{O}((\ln t)^2 / \sqrt{t})$.

GTNov 6, 2017
Social welfare and profit maximization from revealed preferences

Ziwei Ji, Ruta Mehta, Matus Telgarsky

Consider the seller's problem of finding optimal prices for her $n$ (divisible) goods when faced with a set of $m$ consumers, given that she can only observe their purchased bundles at posted prices, i.e., revealed preferences. We study both social welfare and profit maximization with revealed preferences. Although social welfare maximization is a seemingly non-convex optimization problem in prices, we show that (i) it can be reduced to a dual convex optimization problem in prices, and (ii) the revealed preferences can be interpreted as supergradients of the concave conjugate of valuation, with which subgradients of the dual function can be computed. We thereby obtain a simple subgradient-based algorithm for strongly concave valuations and convex cost, with query complexity $O(m^2/ε^2)$, where $ε$ is the additive difference between the social welfare induced by our algorithm and the optimum social welfare. We also study social welfare maximization under the online setting, specifically the random permutation model, where consumers arrive one-by-one in a random order. For the case where consumer valuations can be arbitrary continuous functions, we propose a price posting mechanism that achieves an expected social welfare up to an additive factor of $O(\sqrt{mn})$ from the maximum social welfare. Finally, for profit maximization (which may be non-convex in simple cases), we give nearly matching upper and lower bounds on the query complexity for separable valuations and cost (i.e., each good can be treated independently).

LGJun 26, 2017
Spectrally-normalized margin bounds for neural networks

Peter Bartlett, Dylan J. Foster, Matus Telgarsky

This paper presents a margin-based multiclass generalization bound for neural networks that scales with their margin-normalized "spectral complexity": their Lipschitz constant, meaning the product of the spectral norms of the weight matrices, times a certain correction factor. This bound is empirically investigated for a standard AlexNet network trained with SGD on the mnist and cifar10 datasets, with both original and random labels; the bound, the Lipschitz constants, and the excess risks are all in direct correlation, suggesting both that SGD selects predictors whose complexity scales with the difficulty of the learning task, and secondly that the presented bound is sensitive to this complexity.

LGJun 11, 2017
Neural networks and rational functions

Matus Telgarsky

Neural networks and rational functions efficiently approximate each other. In more detail, it is shown here that for any ReLU network, there exists a rational function of degree $O(\text{polylog}(1/ε))$ which is $ε$-close, and similarly for any rational function there exists a ReLU network of size $O(\text{polylog}(1/ε))$ which is $ε$-close. By contrast, polynomials need degree $Ω(\text{poly}(1/ε))$ to approximate even a single ReLU. When converting a ReLU network to a rational function as above, the hidden constants depend exponentially on the number of layers, which is shown to be tight; in other words, a compositional representation can be beneficial even for rational functions.

LGFeb 13, 2017
Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis

Maxim Raginsky, Alexander Rakhlin, Matus Telgarsky

Stochastic Gradient Langevin Dynamics (SGLD) is a popular variant of Stochastic Gradient Descent, where properly scaled isotropic Gaussian noise is added to an unbiased estimate of the gradient at each iteration. This modest change allows SGLD to escape local minima and suffices to guarantee asymptotic convergence to global minimizers for sufficiently regular non-convex objectives (Gelfand and Mitter, 1991). The present work provides a nonasymptotic analysis in the context of non-convex learning problems, giving finite-time guarantees for SGLD to find approximate minimizers of both empirical and population risks. As in the asymptotic setting, our analysis relates the discrete-time SGLD Markov chain to a continuous-time diffusion process. A new tool that drives the results is the use of weighted transportation cost inequalities to quantify the rate of convergence of SGLD to a stationary distribution in the Euclidean $2$-Wasserstein distance.

DSJul 21, 2016
Greedy bi-criteria approximations for $k$-medians and $k$-means

Daniel Hsu, Matus Telgarsky

This paper investigates the following natural greedy procedure for clustering in the bi-criterion setting: iteratively grow a set of centers, in each round adding the center from a candidate set that maximally decreases clustering cost. In the case of $k$-medians and $k$-means, the key results are as follows. $\bullet$ When the method considers all data points as candidate centers, then selecting $\mathcal{O}(k\log(1/\varepsilon))$ centers achieves cost at most $2+\varepsilon$ times the optimal cost with $k$ centers. $\bullet$ Alternatively, the same guarantees hold if each round samples $\mathcal{O}(k/\varepsilon^5)$ candidate centers proportionally to their cluster cost (as with $\texttt{kmeans++}$, but holding centers fixed). $\bullet$ In the case of $k$-means, considering an augmented set of $n^{\lceil1/\varepsilon\rceil}$ candidate centers gives $1+\varepsilon$ approximation with $\mathcal{O}(k\log(1/\varepsilon))$ centers, the entire algorithm taking $\mathcal{O}(dk\log(1/\varepsilon)n^{1+\lceil1/\varepsilon\rceil})$ time, where $n$ is the number of data points in $\mathbb{R}^d$. $\bullet$ In the case of Euclidean $k$-medians, generating a candidate set via $n^{\mathcal{O}(1/\varepsilon^2)}$ executions of stochastic gradient descent with adaptively determined constraint sets will once again give approximation $1+\varepsilon$ with $\mathcal{O}(k\log(1/\varepsilon))$ centers in $dk\log(1/\varepsilon)n^{\mathcal{O}(1/\varepsilon^2)}$ time. Ancillary results include: guarantees for cluster costs based on powers of metrics; a brief, favorable empirical evaluation against $\texttt{kmeans++}$; data-dependent bounds allowing $1+\varepsilon$ in the first two bullets above, for example with $k$-medians over finite metric spaces.

LGFeb 14, 2016
Benefits of depth in neural networks

Matus Telgarsky

For any positive integer $k$, there exist neural networks with $Θ(k^3)$ layers, $Θ(1)$ nodes per layer, and $Θ(1)$ distinct parameters which can not be approximated by networks with $\mathcal{O}(k)$ layers unless they are exponentially large --- they must possess $Ω(2^k)$ nodes. This result is proved here for a class of nodes termed "semi-algebraic gates" which includes the common choices of ReLU, maximum, indicator, and piecewise polynomial functions, therefore establishing benefits of depth against not just standard networks with ReLU gates, but also convolutional networks with ReLU and maximization gates, sum-product networks, and boosted decision trees (in this last case with a stronger separation: $Ω(2^{k^3})$ total tree nodes are required).

LGSep 27, 2015
Representation Benefits of Deep Feedforward Networks

Matus Telgarsky

This note provides a family of classification problems, indexed by a positive integer $k$, where all shallow networks with fewer than exponentially (in $k$) many nodes exhibit error at least $1/6$, whereas a deep network with 2 nodes in each of $2k$ layers achieves zero error, as does a recurrent network with 3 distinct nodes iterated $k$ times. The proof is elementary, and the networks are standard feedforward networks with ReLU (Rectified Linear Unit) nonlinearities.

LGJun 15, 2015
Convex Risk Minimization and Conditional Probability Estimation

Matus Telgarsky, Miroslav Dudík, Robert Schapire

This paper proves, in very general settings, that convex risk minimization is a procedure to select a unique conditional probability model determined by the classification problem. Unlike most previous work, we give results that are general enough to include cases in which no minimum exists, as occurs typically, for instance, with standard boosting algorithms. Concretely, we first show that any sequence of predictors minimizing convex risk over the source distribution will converge to this unique model when the class of predictors is linear (but potentially of infinite dimension). Secondly, we show the same result holds for \emph{empirical} risk minimization whenever this class of predictors is finite dimensional, where the essential technical contribution is a norm-free generalization bound.

LGOct 2, 2014
Scalable Nonlinear Learning with Adaptive Polynomial Expansions

Alekh Agarwal, Alina Beygelzimer, Daniel Hsu et al.

Can we effectively learn a nonlinear representation in time comparable to linear learning? We describe a new algorithm that explicitly and adaptively expands higher-order interaction features over base linear representations. The algorithm is designed for extreme computational efficiency, and an extensive experimental study shows that its computation/prediction tradeoff ability compares very favorably against strong baselines.

LGNov 8, 2013
Moment-based Uniform Deviation Bounds for $k$-means and Friends

Matus Telgarsky, Sanjoy Dasgupta

Suppose $k$ centers are fit to $m$ points by heuristically minimizing the $k$-means cost; what is the corresponding fit over the source distribution? This question is resolved here for distributions with $p\geq 4$ bounded moments; in particular, the difference between the sample cost and distribution cost decays with $m$ and $p$ as $m^{\min\{-1/4, -1/2+2/p\}}$. The essential technical contribution is a mechanism to uniformly control deviations in the face of unbounded parameter sets, cost functions, and source distributions. To further demonstrate this mechanism, a soft clustering variant of $k$-means cost is also considered, namely the log likelihood of a Gaussian mixture, subject to the constraint that all covariance matrices have bounded spectrum. Lastly, a rate with refined constants is provided for $k$-means instances possessing some cluster structure.

LGMay 13, 2013
Boosting with the Logistic Loss is Consistent

Matus Telgarsky

This manuscript provides optimization guarantees, generalization bounds, and statistical consistency results for AdaBoost variants which replace the exponential loss with the logistic and similar losses (specifically, twice differentiable convex losses which are Lipschitz and tend to zero on one side). The heart of the analysis is to show that, in lieu of explicit regularization and constraints, the structure of the problem is fairly rigidly controlled by the source distribution itself. The first control of this type is in the separable case, where a distribution-dependent relaxed weak learning rate induces speedy convergence with high probability over any sample. Otherwise, in the nonseparable case, the convex surrogate risk itself exhibits distribution-dependent levels of curvature, and consequently the algorithm's output has small norm with high probability.

LGMar 18, 2013
Margins, Shrinkage, and Boosting

Matus Telgarsky

This manuscript shows that AdaBoost and its immediate variants can produce approximate maximum margin classifiers simply by scaling step size choices with a fixed small constant. In this way, when the unscaled step size is an optimal choice, these results provide guarantees for Friedman's empirically successful "shrinkage" procedure for gradient boosting (Friedman, 2000). Guarantees are also provided for a variety of other step sizes, affirming the intuition that increasingly regularized line searches provide improved margin guarantees. The results hold for the exponential loss and similar losses, most notably the logistic loss.

LGJan 21, 2013
Dirichlet draws are sparse with high probability

Matus Telgarsky

This note provides an elementary proof of the folklore fact that draws from a Dirichlet distribution (with parameters less than 1) are typically sparse (most coordinates are small).

LGOct 29, 2012
Tensor decompositions for learning latent variable models

Anima Anandkumar, Rong Ge, Daniel Hsu et al.

This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation---which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.

LGJun 27, 2012
Agglomerative Bregman Clustering

Matus Telgarsky, Sanjoy Dasgupta

This manuscript develops the theory of agglomerative clustering with Bregman divergences. Geometric smoothing techniques are developed to deal with degenerate clusters. To allow for cluster models based on exponential families with overcomplete representations, Bregman divergences are developed for nondifferentiable convex functions.

LGJun 14, 2012
Statistical Consistency of Finite-dimensional Unregularized Linear Classification

Matus Telgarsky

This manuscript studies statistical properties of linear classifiers obtained through minimization of an unregularized convex risk over a finite sample. Although the results are explicitly finite-dimensional, inputs may be passed through feature maps; in this way, in addition to treating the consistency of logistic regression, this analysis also handles boosting over a finite weak learning class with, for instance, the exponential, logistic, and hinge losses. In this finite-dimensional setting, it is still possible to fit arbitrary decision boundaries: scaling the complexity of the weak learning class with the sample size leads to the optimal classification risk almost surely.