Rachel Ward

h-index30

55papers

4,194citations

Novelty53%

AI Score34

Ranked #112,256 of 194,257 authors (top 58%)#1,644 in ML (top 49%)

55 Papers

6.6NAJul 15, 2011

Low-rank matrix recovery via iteratively reweighted least squares minimization

Massimo Fornasier, Holger Rauhut, Rachel Ward

We present and analyze an efficient implementation of an iteratively reweighted least squares algorithm for recovering a matrix from a small number of linear measurements. The algorithm is designed for the simultaneous promotion of both a minimal nuclear norm and an approximatively low-rank solution. Under the assumption that the linear measurements fulfill a suitable generalization of the Null Space Property known in the context of compressed sensing, the algorithm is guaranteed to recover iteratively any matrix with an error of the order of the best k-rank approximation. In certain relevant cases, for instance for the matrix completion problem, our version of this algorithm can take advantage of the Woodbury matrix identity, which allows to expedite the solution of the least squares problems required at each iteration. We present numerical experiments that confirm the robustness of the algorithm for the solution of matrix completion problems, and demonstrate its competitiveness with respect to other techniques proposed recently in the literature.

10.8ITFeb 11, 2011

New and improved Johnson-Lindenstrauss embeddings via the Restricted Isometry Property

Felix Krahmer, Rachel Ward

Consider an m by N matrix Phi with the Restricted Isometry Property of order k and level delta, that is, the norm of any k-sparse vector in R^N is preserved to within a multiplicative factor of 1 +- delta under application of Phi. We show that by randomizing the column signs of such a matrix Phi, the resulting map with high probability embeds any fixed set of p = O(e^k) points in R^N into R^m without distorting the norm of any point in the set by more than a factor of 1 +- delta. Consequently, matrices with the Restricted Isometry Property and with randomized column signs provide optimal Johnson-Lindenstrauss embeddings up to logarithmic factors in N. In particular, our results improve the best known bounds on the necessary embedding dimension m for a wide class of structured random matrices; for partial Fourier and partial Hadamard matrices, we improve the recent bound m = O(delta^(-4) log(p) log^4(N)) appearing in Ailon and Liberty to m = O(delta^(-2) log(p) log^4(N)), which is optimal up to the logarithmic factors in N. Our results also have a direct application in the area of compressed sensing for redundant dictionaries.

6.9CVMar 12, 2013

Stable image reconstruction using total variation minimization

Deanna Needell, Rachel Ward

This article presents near-optimal guarantees for accurate and robust image recovery from under-sampled noisy measurements using total variation minimization. In particular, we show that from O(slog(N)) nonadaptive linear measurements, an image can be reconstructed to within the best s-term approximation of its gradient up to a logarithmic factor, and this factor can be removed by taking slightly more measurements. Along the way, we prove a strengthened Sobolev inequality for functions lying in the null space of suitably incoherent matrices.

8.0NADec 9, 2008

Compressed Sensing with Cross Validation

Rachel Ward

Compressed Sensing decoding algorithms can efficiently recover an N dimensional real-valued vector x to within a factor of its best k-term approximation by taking m = 2klog(N/k) measurements y = Phi x. If the sparsity or approximate sparsity level of x were known, then this theoretical guarantee would imply quality assurance of the resulting compressed sensing estimate. However, because the underlying sparsity of the signal x is unknown, the quality of a compressed sensing estimate x* using m measurements is not assured. Nevertheless, we demonstrate that sharp bounds on the error || x - x* ||_2 can be achieved with almost no effort. More precisely, we assume that a maximum number of measurements m is pre-imposed; we reserve 4log(p) of the original m measurements and compute a sequence of possible estimates (x_j)_{j=1}^p to x from the m - 4log(p) remaining measurements; the errors ||x - x*_j ||_2 for j = 1, ..., p can then be bounded with high probability. As a consequence, numerical upper and lower bounds on the error between x and the best k-term approximation to x can be estimated for p values of k with almost no cost. Our observation has applications outside of compressed sensing as well.

28.5LGMay 19, 2022

How catastrophic can catastrophic forgetting be in linear regression?

Itay Evron, Edward Moroshko, Rachel Ward et al.

To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method. In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas. In particular, when T tasks in d dimensions are presented cyclically for k iterations, we prove an upper bound of T^2 * min{1/sqrt(k), d/k} on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the T^2 factor can be lifted when tasks are presented in a random ordering.

5.9NAFeb 20, 2011

Sparse recovery for spherical harmonic expansions

Holger Rauhut, Rachel Ward

We show that sparse spherical harmonic expansions can be efficiently recovered from a small number of randomly chosen samples on the sphere. To establish the main result, we verify the restricted isometry property of an associated preconditioned random measurement matrix using recent estimates on the uniform growth of Jacobi polynomials.

1.2NAApr 1, 2012

Two-subspace Projection Method for Coherent Overdetermined Systems (Technical Report)

Deanna Needell, Rachel Ward

In this technical report we present a Projection onto Convex Sets (POCS) type algorithm for solving systems of linear equations. POCS methods have found many applications ranging from computer tomography to digital signal and image processing. The Kaczmarz method is one of the most popular solvers for overdetermined systems of linear equations due to its speed and simplicity. Here we introduce and analyze an extension of the Kaczmarz method which iteratively projects the estimate onto a solution space given from two randomly selected rows. We show that this projection algorithm provides exponential convergence to the solution in expectation. The convergence rate significantly improves upon that of the standard randomized Kaczmarz method when the system has coherent rows. We also show that the method is robust to noise, and converges exponentially in expectation to the noise floor. Experimental results are provided which confirm that in the coherent case our method significantly outperforms the randomized Kaczmarz method.

6.6NAApr 28, 2009

Iterative thresholding meets free discontinuity problems

Massimo Fornasier, Rachel Ward

Free-discontinuity problems describe situations where the solution of interest is defined by a function and a lower dimensional set consisting of the discontinuities of the function. Hence, the derivative of the solution is assumed to be a `small' function almost everywhere except on sets where it concentrates as a singular measure. This is the case, for instance, in crack detection from fracture mechanics or in certain digital image segmentation problems. If we discretize such situations for numerical purposes, the free-discontinuity problem in the discrete setting can be re-formulated as that of finding a derivative vector with small components at all but a few entries that exceed a certain threshold. This problem is similar to those encountered in the field of `sparse recovery', where vectors with a small number of dominating components in absolute value are recovered from a few given linear measurements via the minimization of related energy functionals. Several iterative thresholding algorithms that intertwine gradient-type iterations with thresholding steps have been designed to recover sparse solutions in this setting. It is natural to wonder if and/or how such algorithms can be used towards solving discrete free-discontinuity problems. The current paper explores this connection, and, by establishing an iterative thresholding algorithm for discrete free-discontinuity problems, provides new insights on properties of minimizing solutions thereof.

1.2ITMay 10, 2018

Extracting structured dynamical systems using sparse optimization with very few samples

Hayden Schaeffer, Giang Tran, Rachel Ward et al.

Learning governing equations allows for deeper understanding of the structure and dynamics of data. We present a random sampling method for learning structured dynamical systems from under-sampled and possibly noisy state-space measurements. The learning problem takes the form of a sparse least-squares fitting over a large set of candidate functions. Based on a Bernstein-like inequality for partly dependent random variables, we provide theoretical guarantees on the recovery rate of the sparse coefficients and the identification of the candidate functions for the corresponding problem. Computational results are demonstrated on datasets generated by the Lorenz 96 equation, the viscous Burgers' equation, and the two-component reaction-diffusion equations (which is challenging due to parameter sensitives in the model). This formulation has several advantages including ease of use, theoretical guarantees of success, and computational efficiency with respect to ambient dimension and number of candidate functions.

14.1LGJun 15, 2022

On the fast convergence of minibatch heavy ball momentum

Raghu Bollapragada, Tyler Chen, Rachel Ward

Simple stochastic momentum methods are widely used in machine learning optimization, but their good practical performance is at odds with an absence of theoretical guarantees of acceleration in the literature. In this work, we aim to close the gap between theory and practice by showing that stochastic heavy ball momentum retains the fast linear rate of (deterministic) heavy ball momentum on quadratic optimization problems, at least when minibatching with a sufficiently large batch size. The algorithm we study can be interpreted as an accelerated randomized Kaczmarz algorithm with minibatching and heavy ball momentum. The analysis relies on carefully decomposing the momentum transition matrix, and using new spectral norm concentration bounds for products of independent random matrices. We provide numerical illustrations demonstrating that our bounds are reasonably sharp.

1.2NAOct 25, 2012

A symbol-based algorithm for decoding bar codes

Mark Iwen, Fadil Santosa, Rachel Ward

We investigate the problem of decoding a bar code from a signal measured with a hand-held laser-based scanner. Rather than formulating the inverse problem as one of binary image reconstruction, we instead incorporate the symbology of the bar code into the reconstruction algorithm directly, and search for a sparse representation of the UPC bar code with respect to this known dictionary. Our approach significantly reduces the degrees of freedom in the problem, allowing for accurate reconstruction that is robust to noise and unknown parameters in the scanning device. We propose a greedy reconstruction algorithm and provide robust reconstruction guarantees. Numerical examples illustrate the insensitivity of our symbology-based reconstruction to both imprecise model parameters and noise on the scanned measurements.

2.3NAApr 2, 2011

Sparse Legendre expansions via $\ell_1$ minimization

Holger Rauhut, Rachel Ward

We consider the problem of recovering polynomials that are sparse with respect to the basis of Legendre polynomials from a small number of random samples. In particular, we show that a Legendre s-sparse polynomial of maximal degree N can be recovered from m = O(s log^4(N)) random samples that are chosen independently according to the Chebyshev probability measure. As an efficient recovery method, l1-minimization can be used. We establish these results by verifying the restricted isometry property of a preconditioned random Legendre matrix. We then extend these results to a large class of orthogonal polynomial systems, including the Jacobi polynomials, of which the Legendre polynomials are a special case. Finally, we transpose these results into the setting of approximate recovery for functions in certain infinite-dimensional function spaces.

8.7LGMar 1, 2022

Side Effects of Learning from Low-dimensional Data Embedded in a Euclidean Space

Juncai He, Richard Tsai, Rachel Ward

The low-dimensional manifold hypothesis posits that the data found in many applications, such as those involving natural images, lie (approximately) on low-dimensional manifolds embedded in a high-dimensional Euclidean space. In this setting, a typical neural network defines a function that takes a finite number of vectors in the embedding space as input. However, one often needs to consider evaluating the optimized network at points outside the training distribution. This paper considers the case in which the training data is distributed in a linear subspace of $\mathbb R^d$. We derive estimates on the variation of the learning function, defined by a neural network, in the direction transversal to the subspace. We study the potential regularization effects associated with the network's depth and noise in the codimension of the data manifold. We also present additional side effects in training due to the presence of noise.

10.8MLJul 20, 2023

Cluster-aware Semi-supervised Learning: Relational Knowledge Distillation Provably Learns Clustering

Yijun Dong, Kevin Miller, Qi Lei et al.

Despite the empirical success and practical significance of (relational) knowledge distillation that matches (the relations of) features between teacher and student models, the corresponding theoretical interpretations remain limited for various knowledge distillation paradigms. In this work, we take an initial step toward a theoretical understanding of relational knowledge distillation (RKD), with a focus on semi-supervised classification problems. We start by casting RKD as spectral clustering on a population-induced graph unveiled by a teacher model. Via a notion of clustering error that quantifies the discrepancy between the predicted and ground truth clusterings, we illustrate that RKD over the population provably leads to low clustering error. Moreover, we provide a sample complexity bound for RKD with limited unlabeled samples. For semi-supervised learning, we further demonstrate the label efficiency of RKD through a general framework of cluster-aware semi-supervised learning that assumes low clustering errors. Finally, by unifying data augmentation consistency regularization into this cluster-aware framework, we show that despite the common effect of learning accurate clusterings, RKD facilitates a "global" perspective through spectral clustering, whereas consistency regularization focuses on a "local" perspective via expansion.

10.8MLApr 14, 2022

Concentration of Random Feature Matrices in High-Dimensions

Zhijun Chen, Hayden Schaeffer, Rachel Ward

The spectra of random feature matrices provide essential information on the conditioning of the linear system used in random feature regression problems and are thus connected to the consistency and generalization of random feature models. Random feature matrices are asymmetric rectangular nonlinear matrices depending on two input variables, the data and the weights, which can make their characterization challenging. We consider two settings for the two input variables, either both are random variables or one is a random variable and the other is well-separated, i.e. there is a minimum distance between points. With conditions on the dimension, the complexity ratio, and the sampling variance, we show that the singular values of these matrices concentrate near their full expectation and near one with high-probability. In particular, since the dimension depends only on the logarithm of the number of random weights or the number of data points, our complexity bounds can be achieved even in moderate dimensions for many practical setting. The theoretical results are verified with numerical experiments.

5.3MLMay 16, 2022

An Exponentially Increasing Step-size for Parameter Estimation in Statistical Models

Nhat Ho, Tongzheng Ren, Sujay Sanghavi et al.

Using gradient descent (GD) with fixed or decaying step-size is a standard practice in unconstrained optimization problems. However, when the loss function is only locally convex, such a step-size schedule artificially slows GD down as it cannot explore the flat curvature of the loss function. To overcome that issue, we propose to exponentially increase the step-size of the GD algorithm. Under homogeneous assumptions on the loss function, we demonstrate that the iterates of the proposed \emph{exponential step size gradient descent} (EGD) algorithm converge linearly to the optimal solution. Leveraging that optimization insight, we then consider using the EGD algorithm for solving parameter estimation under both regular and non-regular statistical models whose loss function becomes locally convex when the sample size goes to infinity. We demonstrate that the EGD iterates reach the final statistical radius within the true parameter after a logarithmic number of iterations, which is in stark contrast to a \emph{polynomial} number of iterations of the GD algorithm in non-regular statistical models. Therefore, the total computational complexity of the EGD algorithm is \emph{optimal} and exponentially cheaper than that of the GD for solving parameter estimation in non-regular statistical models while being comparable to that of the GD in regular statistical settings. To the best of our knowledge, it resolves a long-standing gap between statistical and algorithmic computational complexities of parameter estimation in non-regular statistical models. Finally, we provide targeted applications of the general theory to several classes of statistical models, including generalized linear models with polynomial link functions and location Gaussian mixture models.

2.6CVOct 4, 2022

Adaptively Weighted Data Augmentation Consistency Regularization for Robust Optimization under Concept Shift

Yijun Dong, Yuege Xie, Rachel Ward

Concept shift is a prevailing problem in natural tasks like medical image segmentation where samples usually come from different subpopulations with variant correlations between features and labels. One common type of concept shift in medical image segmentation is the "information imbalance" between label-sparse samples with few (if any) segmentation labels and label-dense samples with plentiful labeled pixels. Existing distributionally robust algorithms have focused on adaptively truncating/down-weighting the "less informative" (i.e., label-sparse in our context) samples. To exploit data features of label-sparse samples more efficiently, we propose an adaptively weighted online optimization algorithm -- AdaWAC -- to incorporate data augmentation consistency regularization in sample reweighting. Our method introduces a set of trainable weights to balance the supervised loss and unsupervised consistency regularization of each sample separately. At the saddle point of the underlying objective, the weights assign label-dense samples to the supervised loss and label-sparse samples to the unsupervised consistency regularization. We provide a convergence guarantee by recasting the optimization as online mirror descent on a saddle point problem. Our empirical results demonstrate that AdaWAC not only enhances the segmentation performance and sample efficiency but also improves the robustness to concept shift on various medical image segmentation tasks with different UNet-style backbones.

2.3NAJun 6, 2008

On Robustness Properties of Beta Encoders and Golden Ratio Encoders

Rachel Ward

The beta-encoder was recently proposed as a quantization scheme for analog-to-digital conversion; in contrast to classical binary quantization, in which each analog sample x in [-1,1] is mapped to the first N bits of its base-2 expansion, beta-encoders replace each sample x with its expansion in a base beta satisfying 1 < beta < 2. This expansion is non-unique when 1 < beta < 2, and the beta-encoder exploits this redundancy to correct inevitable errors made by the quantizer component of its circuit design. The multiplier element of the beta-encoder will also be imprecise; effectively, the true value beta at any time can only be specified to within an interval [ beta_{low}, beta_{high} ]. This problem was addressed by the golden ratio encoder, a close relative of the beta-encoder that does not require a precise multiplier. However, the golden ratio encoder is susceptible to integrator leak in the delay elements of its hardware design, and this has the same effect of changing beta to an unknown value. In this paper, we present a method whereby exponentially precise approximations to the value of beta in both golden ratio encoders and beta encoders can be recovered amidst imprecise circuit components from the truncated beta-expansions of a "test" number x_{test} in [-1,1], and its negative counterpart, -x_{test}. That is, beta-encoders and golden ratio encoders are robust with respect to unavoidable analog component imperfections that change the base beta needed for reconstruction.

2.3NAOct 8, 2012

Two-subspace Projection Method for Coherent Overdetermined Systems

Deanna Needell, Rachel Ward

We present a Projection onto Convex Sets (POCS) type algorithm for solving systems of linear equations. POCS methods have found many applications ranging from computer tomography to digital signal and image processing. The Kaczmarz method is one of the most popular solvers for overdetermined systems of linear equations due to its speed and simplicity. Here we introduce and analyze an extension of the Kaczmarz method that iteratively projects the estimate onto a solution space given by two randomly selected rows. We show that this projection algorithm provides exponential convergence to the solution in expectation. The convergence rate improves upon that of the standard randomized Kaczmarz method when the system has correlated rows. Experimental results confirm that in this case our method significantly outperforms the randomized Kaczmarz method.

16.4LGJan 4, 2024Code

Generating synthetic data for neural operators

Erisa Hasani, Rachel A. Ward

Recent advances in the literature show promising potential of deep learning methods, particularly neural operators, in obtaining numerical solutions to partial differential equations (PDEs) beyond the reach of current numerical solvers. However, existing data-driven approaches often rely on training data produced by numerical PDE solvers (e.g., finite difference or finite element methods). We introduce a "backward" data generation method that avoids solving the PDE numerically: by randomly sampling candidate solutions $u_j$ from the appropriate solution space (e.g., $H_0^1(Ω)$), we compute the corresponding right-hand side $f_j$ directly from the equation by differentiation. This produces training pairs ${(f_j, u_j)}$ by computing derivatives rather than solving a PDE numerically for each data point, enabling fast, large-scale data generation consisting of exact solutions. Experiments indicate that models trained on this synthetic data generalize well when tested on data produced by standard solvers. While the idea is simple, we hope this method will expand the potential of neural PDE solvers that do not rely on classical numerical solvers to generate their data.

7.9LGOct 12, 2024

Provable Acceleration of Nesterov's Accelerated Gradient for Rectangular Matrix Factorization and Linear Neural Networks

Zhenghao Xu, Yuqing Wang, Tuo Zhao et al.

We study the convergence rate of first-order methods for rectangular matrix factorization, which is a canonical nonconvex optimization problem. Specifically, given a rank-$r$ matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$, we prove that gradient descent (GD) can find a pair of $ε$-optimal solutions $\mathbf{X}_T\in\mathbb{R}^{m\times d}$ and $\mathbf{Y}_T\in\mathbb{R}^{n\times d}$, where $d\geq r$, satisfying $\lVert\mathbf{X}_T\mathbf{Y}_T^\top-\mathbf{A}\rVert_\mathrm{F}\leqε\lVert\mathbf{A}\rVert_\mathrm{F}$ in $T=O(κ^2\log\frac{1}ε)$ iterations with high probability, where $κ$ denotes the condition number of $\mathbf{A}$. Furthermore, we prove that Nesterov's accelerated gradient (NAG) attains an iteration complexity of $O(κ\log\frac{1}ε)$, which is the best-known bound of first-order methods for rectangular matrix factorization. Different from small balanced random initialization in the existing literature, we adopt an unbalanced initialization, where $\mathbf{X}_0$ is large and $\mathbf{Y}_0$ is $0$. Moreover, our initialization and analysis can be further extended to linear neural networks, where we prove that NAG can also attain an accelerated linear convergence rate. In particular, we only require the width of the network to be greater than or equal to the rank of the output label matrix. In contrast, previous results achieving the same rate require excessive widths that additionally depend on the condition number and the rank of the input data matrix.

14.9LGMay 11, 2023

Convergence of Alternating Gradient Descent for Matrix Factorization

Rachel Ward, Tamara G. Kolda

We consider alternating gradient descent (AGD) with fixed step size applied to the asymmetric matrix factorization objective. We show that, for a rank-$r$ matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$, $T = C (\frac{σ_1(\mathbf{A})}{σ_r(\mathbf{A})})^2 \log(1/ε)$ iterations of alternating gradient descent suffice to reach an $ε$-optimal factorization $\| \mathbf{A} - \mathbf{X} \mathbf{Y}^{T} \|^2 \leq ε\| \mathbf{A}\|^2$ with high probability starting from an atypical random initialization. The factors have rank $d \geq r$ so that $\mathbf{X}_{T}\in\mathbb{R}^{m \times d}$ and $\mathbf{Y}_{T} \in\mathbb{R}^{n \times d}$, and mild overparameterization suffices for the constant $C$ in the iteration complexity $T$ to be an absolute constant. Experiments suggest that our proposed initialization is not merely of theoretical benefit, but rather significantly improves the convergence rate of gradient descent in practice. Our proof is conceptually simple: a uniform Polyak-Łojasiewicz (PL) inequality and uniform Lipschitz smoothness constant are guaranteed for a sufficient number of iterations, starting from our random initialization. Our proof method should be useful for extending and simplifying convergence analyses for a broader class of nonconvex low-rank factorization problems.

11.5LGMay 9, 2023

Robust Implicit Regularization via Weight Normalization

Hung-Hsu Chou, Holger Rauhut, Rachel Ward

Overparameterized models may have many interpolating solutions; implicit regularization refers to the hidden preference of a particular optimization method towards a certain interpolating solution among the many. A by now established line of work has shown that (stochastic) gradient descent tends to have an implicit bias towards low rank and/or sparse solutions when used to train deep linear networks, explaining to some extent why overparameterized neural network models trained by gradient descent tend to have good generalization performance in practice. However, existing theory for square-loss objectives often requires very small initialization of the trainable weights, which is at odds with the larger scale at which weights are initialized in practice for faster convergence and better generalization performance. In this paper, we aim to close this gap by incorporating and analyzing gradient flow (continuous-time version of gradient descent) with weight normalization, where the weight vector is reparameterized in terms of polar coordinates, and gradient flow is applied to the polar coordinates. By analyzing key invariants of the gradient flow and using Lojasiewicz Theorem, we show that weight normalization also has an implicit bias towards sparse solutions in the diagonal linear model, but that in contrast to plain gradient flow, weight normalization enables a robust bias that persists even if the weights are initialized at practically large scale. Experiments suggest that the gains in both convergence speed and robustness of the implicit bias are improved dramatically by using weight normalization in overparameterized diagonal linear network models.

15.6LGFeb 24, 2022

Sample Efficiency of Data Augmentation Consistency Regularization

Shuo Yang, Yijun Dong, Rachel Ward et al.

Data augmentation is popular in the training of large neural networks; currently, however, there is no clear theoretical comparison between different algorithmic choices on how to use augmented data. In this paper, we take a step in this direction - we first present a simple and novel analysis for linear regression with label invariant augmentations, demonstrating that data augmentation consistency (DAC) is intrinsically more efficient than empirical risk minimization on augmented data (DA-ERM). The analysis is then extended to misspecified augmentations (i.e., augmentations that change the labels), which again demonstrates the merit of DAC over DA-ERM. Further, we extend our analysis to non-linear models (e.g., neural networks) and present generalization bounds. Finally, we perform experiments that make a clean and apples-to-apples comparison (i.e., with no extra modeling or data tweaks) between DAC and DA-ERM using CIFAR-100 and WideResNet; these together demonstrate the superior efficacy of DAC.

26.1MLFeb 11, 2022

The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance

Matthew Faw, Isidoros Tziotis, Constantine Caramanis et al.

We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient variance (or even noise support), (iii) conditional independence between the step size and stochastic gradient. In this work, we show that AdaGrad-Norm exhibits an order optimal convergence rate of $\mathcal{O}\left(\frac{\mathrm{poly}\log(T)}{\sqrt{T}}\right)$ after $T$ iterations under the same assumptions as optimally-tuned non adaptive SGD (unbounded gradient norms and affine noise variance scaling), and crucially, without needing any tuning parameters. We thus establish that adaptive gradient methods exhibit order-optimal convergence in much broader regimes than previously understood.

9.9LGDec 7, 2021Code

SHRIMP: Sparser Random Feature Models via Iterative Magnitude Pruning

Yuege Xie, Bobby Shi, Hayden Schaeffer et al.

Sparse shrunk additive models and sparse random feature models have been developed separately as methods to learn low-order functions, where there are few interactions between variables, but neither offers computational efficiency. On the other hand, $\ell_2$-based shrunk additive models are efficient but do not offer feature selection as the resulting coefficient vectors are dense. Inspired by the success of the iterative magnitude pruning technique in finding lottery tickets of neural networks, we propose a new method -- Sparser Random Feature Models via IMP (ShRIMP) -- to efficiently fit high-dimensional data with inherent low-dimensional structure in the form of sparse variable dependencies. Our method can be viewed as a combined process to construct and find sparse lottery tickets for two-layer dense networks. We explain the observed benefit of SHRIMP through a refined analysis on the generalization error for thresholded Basis Pursuit and resulting bounds on eigenvalues. From function approximation experiments on both synthetic data and real-world benchmark datasets, we show that SHRIMP obtains better than or competitive test accuracy compared to state-of-art sparse feature and additive methods such as SRFE-S, SSAM, and SALSA. Meanwhile, SHRIMP performs feature selection with low computational complexity and is robust to the pruning rate, indicating a robustness in the structure of the obtained subnetworks. We gain insight into the lottery ticket hypothesis through SHRIMP by noting a correspondence between our model and weight/neuron subnetworks.

11.7DSSep 20, 2021Code

Learning to Forecast Dynamical Systems from Streaming Data

Dimitris Giannakis, Amelia Henriksen, Joel A. Tropp et al.

Kernel analog forecasting (KAF) is a powerful methodology for data-driven, non-parametric forecasting of dynamically generated time series data. This approach has a rigorous foundation in Koopman operator theory and it produces good forecasts in practice, but it suffers from the heavy computational costs common to kernel methods. This paper proposes a streaming algorithm for KAF that only requires a single pass over the training data. This algorithm dramatically reduces the costs of training and prediction without sacrificing forecasting skill. Computational experiments demonstrate that the streaming KAF method can successfully forecast several classes of dynamical systems (periodic, quasi-periodic, and chaotic) in both data-scarce and data-rich regimes. The overall methodology may have wider interest as a new template for streaming kernel regression.

6.3MLSep 17, 2021

AdaLoss: A computationally-efficient and provably convergent adaptive gradient method

Xiaoxia Wu, Yuege Xie, Simon Du et al.

We propose a computationally-friendly adaptive learning rate schedule, "AdaLoss", which directly uses the information of the loss function to adjust the stepsize in gradient descent methods. We prove that this schedule enjoys linear convergence in linear regression. Moreover, we provide a linear convergence guarantee over the non-convex regime, in the context of two-layer over-parameterized neural networks. If the width of the first-hidden layer in the two-layer networks is sufficiently large (polynomially), then AdaLoss converges robustly \emph{to the global minimum} in polynomial time. We numerically verify the theoretical results and extend the scope of the numerical experiments by considering applications in LSTM models for text clarification and policy gradients for control problems.

6.6STJun 28, 2021

Bootstrapping the error of Oja's algorithm

Robert Lunde, Purnamrita Sarkar, Rachel Ward

We consider the problem of quantifying uncertainty for the estimation error of the leading eigenvector from Oja's algorithm for streaming principal component analysis, where the data are generated IID from some unknown distribution. By combining classical tools from the U-statistics literature with recent results on high-dimensional central limit theorems for quadratic forms of random vectors and concentration of matrix products, we establish a weighted $χ^2$ approximation result for the $\sin^2$ error between the population eigenvector and the output of Oja's algorithm. Since estimating the covariance matrix associated with the approximating distribution requires knowledge of unknown model parameters, we propose a multiplier bootstrap algorithm that may be updated in an online manner. We establish conditions under which the bootstrap distribution is close to the corresponding sampling distribution with high probability, thereby establishing the bootstrap as a consistent inferential method in an appropriate asymptotic regime.

17.4MLMar 4, 2021Code

Generalization Bounds for Sparse Random Feature Expansions

Abolfazl Hashemi, Hayden Schaeffer, Robert Shi et al.

Random feature methods have been successful in various machine learning tasks, are easy to compute, and come with theoretical accuracy bounds. They serve as an alternative approach to standard neural networks since they can represent similar function spaces without a costly training phase. However, for accuracy, random feature methods require more measurements than trainable parameters, limiting their use for data-scarce applications or problems in scientific machine learning. This paper introduces the sparse random feature expansion to obtain parsimonious random feature models. Specifically, we leverage ideas from compressive sensing to generate random feature expansions with theoretical guarantees even in the data-scarce setting. In particular, we provide generalization bounds for functions in a certain class (that is dense in a reproducing kernel Hilbert space) depending on the number of samples and the distribution of features. The generalization bounds improve with additional structural conditions, such as coordinate sparsity, compact clusters of the spectrum, or rapid spectral decay. In particular, by introducing sparse features, i.e. features with random sparse weights, we provide improved bounds for low order functions. We show that the sparse random feature expansions outperforms shallow networks in several scientific machine learning tasks.

11.3DSFeb 6, 2021

Streaming k-PCA: Efficient guarantees for Oja's algorithm, beyond rank-one updates

De Huang, Jonathan Niles-Weed, Rachel Ward

We analyze Oja's algorithm for streaming $k$-PCA and prove that it achieves performance nearly matching that of an optimal offline algorithm. Given access to a sequence of i.i.d. $d \times d$ symmetric matrices, we show that Oja's algorithm can obtain an accurate approximation to the subspace of the top $k$ eigenvectors of their expectation using a number of samples that scales polylogarithmically with $d$. Previously, such a result was only known in the case where the updates have rank one. Our analysis is based on recently developed matrix concentration tools, which allow us to prove strong bounds on the tails of the random matrices which arise in the course of the algorithm's execution.

7.2LGJun 15, 2020

Overparameterization and generalization error: weighted trigonometric interpolation

Yuege Xie, Hung-Hsu Chou, Holger Rauhut et al.

Motivated by surprisingly good generalization properties of learned deep neural networks in overparameterized scenarios and by the related double descent phenomenon, this paper analyzes the relation between smoothness and low generalization error in an overparameterized linear learning problem. We study a random Fourier series model, where the task is to estimate the unknown Fourier coefficients from equidistant samples. We derive exact expressions for the generalization error of both plain and weighted least squares estimators. We show precisely how a bias towards smooth interpolants, in the form of weighted trigonometric interpolation, can lead to smaller generalization error in the overparameterized regime compared to the underparameterized regime. This provides insight into the power of overparameterization, which is common in modern machine learning.

13.7LGNov 18, 2019

Implicit Regularization and Convergence for Weight Normalization

Xiaoxia Wu, Edgar Dobriban, Tongzheng Ren et al.

Normalization methods such as batch [Ioffe and Szegedy, 2015], weight [Salimansand Kingma, 2016], instance [Ulyanov et al., 2016], and layer normalization [Baet al., 2016] have been widely used in modern machine learning. Here, we study the weight normalization (WN) method [Salimans and Kingma, 2016] and a variant called reparametrized projected gradient descent (rPGD) for overparametrized least-squares regression. WN and rPGD reparametrize the weights with a scale g and a unit vector w and thus the objective function becomes non-convex. We show that this non-convex formulation has beneficial regularization effects compared to gradient descent on the original objective. These methods adaptively regularize the weights and converge close to the minimum l2 norm solution, even for initializations far from zero. For certain stepsizes of g and w , we show that they can converge close to the minimum norm solution. This is different from the behavior of gradient descent, which converges to the minimum norm solution only when started at a point in the range space of the feature matrix, and is thus more sensitive to initialization.

18.8MLAug 28, 2019

Linear Convergence of Adaptive Stochastic Gradient Descent

Yuege Xie, Xiaoxia Wu, Rachel Ward

We prove that the norm version of the adaptive stochastic gradient method (AdaGrad-Norm) achieves a linear convergence rate for a subset of either strongly convex functions or non-convex functions that satisfy the Polyak Lojasiewicz (PL) inequality. The paper introduces the notion of Restricted Uniform Inequality of Gradients (RUIG)---which is a measure of the balanced-ness of the stochastic gradient norms---to depict the landscape of a function. RUIG plays a key role in proving the robustness of AdaGrad-Norm to its hyper-parameter tuning in the stochastic setting. On top of RUIG, we develop a two-stage framework to prove the linear convergence of AdaGrad-Norm without knowing the parameters of the objective functions. This framework can likely be extended to other adaptive stepsize algorithms. The numerical experiments validate the theory and suggest future directions for improvement.

4.9MLJul 26, 2019

Bias of Homotopic Gradient Descent for the Hinge Loss

Denali Molitor, Deanna Needell, Rachel Ward

Gradient descent is a simple and widely used optimization method for machine learning. For homogeneous linear classifiers applied to separable data, gradient descent has been shown to converge to the maximal margin (or equivalently, the minimal norm) solution for various smooth loss functions. The previous theory does not, however, apply to non-smooth functions such as the hinge loss which is widely used in practice. Here, we study the convergence of a homotopic variant of gradient descent applied to the hinge loss and provide explicit convergence rates to the max-margin solution for linearly separable data.

10.4MLMay 28, 2019

AdaOja: Adaptive Learning Rates for Streaming PCA

Amelia Henriksen, Rachel Ward

Oja's algorithm has been the cornerstone of streaming methods in Principal Component Analysis (PCA) since it was first proposed in 1982. However, Oja's algorithm does not have a standardized choice of learning rate (step size) that both performs well in practice and truly conforms to the online streaming setting. In this paper, we propose a new learning rate scheme for Oja's method called AdaOja. This new algorithm requires only a single pass over the data and does not depend on knowing properties of the data set a priori. AdaOja is a novel variation of the Adagrad algorithm to Oja's algorithm in the single eigenvector case and extended to the multiple eigenvector case. We demonstrate for dense synthetic data, sparse real-world data and dense real-world data that AdaOja outperforms common learning rate choices for Oja's method. We also show that AdaOja performs comparably to state-of-the-art algorithms (History PCA and Streaming Power Method) in the same streaming PCA setting.

22.7LGFeb 19, 2019

Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network

Xiaoxia Wu, Simon S. Du, Rachel Ward

Adaptive gradient methods like AdaGrad are widely used in optimizing neural networks. Yet, existing convergence guarantees for adaptive gradient methods require either convexity or smoothness, and, in the smooth setting, only guarantee convergence to a stationary point. We propose an adaptive gradient method and show that for two-layer over-parameterized neural networks -- if the width is sufficiently large (polynomially) -- then the proposed method converges \emph{to the global minimum} in polynomial time, and convergence is robust, \emph{ without the need to fine-tune hyper-parameters such as the step-size schedule and with the level of over-parametrization independent of the training error}. Our analysis indicates in particular that over-parametrization is crucial for the harnessing the full potential of adaptive gradient methods in the setting of neural networks.

3.3ITNov 25, 2018

Recovery guarantees for polynomial approximation from dependent data with outliers

Lam Si Tung Ho, Hayden Schaeffer, Giang Tran et al.

Learning non-linear systems from noisy, limited, and/or dependent data is an important task across various scientific fields including statistics, engineering, computer science, mathematics, and many more. In general, this learning task is ill-posed; however, additional information about the data's structure or on the behavior of the unknown function can make the task well-posed. In this work, we study the problem of learning nonlinear functions from corrupted and dependent data. The learning problem is recast as a sparse robust linear regression problem where we incorporate both the unknown coefficients and the corruptions in a basis pursuit framework. The main contribution of our paper is to provide a reconstruction guarantee for the associated $\ell_1$-optimization problem where the sampling matrix is formed from dependent data. Specifically, we prove that the sampling matrix satisfies the null space property and the stable null space property, provided that the data is compact and satisfies a suitable concentration inequality. We show that our recovery results are applicable to various types of dependent data such as exponentially strongly $α$-mixing data, geometrically $\mathcal{C}$-mixing data, and uniformly ergodic Markov chain. Our theoretical results are verified via several numerical simulations.

35.3MLJun 5, 2018

AdaGrad stepsizes: Sharp convergence over nonconvex landscapes

Rachel Ward, Xiaoxia Wu, Leon Bottou

Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread use in large-scale optimization for their ability to converge robustly, without the need to fine-tune the stepsize schedule. Yet, the theoretical guarantees to date for AdaGrad are for online and convex optimization. We bridge this gap by providing theoretical guarantees for the convergence of AdaGrad for smooth, nonconvex functions. We show that the norm version of AdaGrad (AdaGrad-Norm) converges to a stationary point at the $\mathcal{O}(\log(N)/\sqrt{N})$ rate in the stochastic setting, and at the optimal $\mathcal{O}(1/N)$ rate in the batch (non-stochastic) setting -- in this sense, our convergence guarantees are 'sharp'. In particular, the convergence of AdaGrad-Norm is robust to the choice of all hyper-parameters of the algorithm, in contrast to stochastic gradient descent whose convergence depends crucially on tuning the step-size to the (generally unknown) Lipschitz smoothness constant and level of stochastic noise on the gradient. Extensive numerical experiments are provided to corroborate our theory; moreover, the experiments suggest that the robustness of AdaGrad-Norm extends to state-of-the-art models in deep learning, without sacrificing generalization.

23.2MLMar 7, 2018

WNGrad: Learn the Learning Rate in Gradient Descent

Xiaoxia Wu, Rachel Ward, Léon Bottou

Adjusting the learning rate schedule in stochastic gradient methods is an important unresolved problem which requires tuning in practice. If certain parameters of the loss function such as smoothness or strong convexity constants are known, theoretical learning rate schedules can be applied. However, in practice, such parameters are not known, and the loss function of interest is not convex in any case. The recently proposed batch normalization reparametrization is widely adopted in most neural network architectures today because, among other advantages, it is robust to the choice of Lipschitz constant of the gradient in loss function, allowing one to set a large learning rate without worry. Inspired by batch normalization, we propose a general nonlinear update rule for the learning rate in batch and stochastic gradient descent so that the learning rate can be initialized at a high value, and is subsequently decreased according to gradient observations along the way. The proposed method is shown to achieve robustness to the relationship between the learning rate and the Lipschitz constant, and near-optimal convergence rates in both the batch and stochastic settings ($O(1/T)$ for smooth loss in the batch setting, and $O(1/\sqrt{T})$ for convex loss in the stochastic setting). We also show through numerical evidence that such robustness of the proposed method extends to highly nonconvex and possibly non-smooth loss function in deep learning problems.Our analysis establishes some first theoretical understanding into the observed robustness for batch normalization and weight normalization.

2.3NASep 5, 2017

Learning Dynamical Systems and Bifurcation via Group Sparsity

Hayden Schaeffer, Giang Tran, Rachel Ward

Learning governing equations from a family of data sets which share the same physical laws but differ in bifurcation parameters is challenging. This is due, in part, to the wide range of phenomena that could be represented in the data sets as well as the range of parameter values. On the other hand, it is common to assume only a small number of candidate functions contribute to the observed dynamics. Based on these observations, we propose a group-sparse penalized method for model selection and parameter estimation for such data. We also provide convergence guarantees for our proposed numerical scheme. Various numerical experiments including the 1D logistic equation, the 3D Lorenz sampled from different bifurcation regions, and a switching system provide numerical validation for our method and suggest potential applications to applied dynamical systems.

7.3GTOct 17, 2016

A polynomial-time relaxation of the Gromov-Hausdorff distance

Soledad Villar, Afonso S. Bandeira, Andrew J. Blumberg et al.

The Gromov-Hausdorff distance provides a metric on the set of isometry classes of compact metric spaces. Unfortunately, computing this metric directly is believed to be computationally intractable. Motivated by applications in shape matching and point-cloud comparison, we study a semidefinite programming relaxation of the Gromov-Hausdorff metric. This relaxation can be computed in polynomial time, and somewhat surprisingly is itself a pseudometric. We describe the induced topology on the set of compact metric spaces. Finally, we demonstrate the numerical performance of various algorithms for computing the relaxed distance and apply these algorithms to several relevant data sets. In particular we propose a greedy algorithm for finding the best correspondence between finite metric spaces that can handle hundreds of points.

19.1MLFeb 22, 2016

Clustering subgaussian mixtures by semidefinite programming

Dustin G. Mixon, Soledad Villar, Rachel Ward

We introduce a model-free relax-and-round algorithm for k-means clustering based on a semidefinite relaxation due to Peng and Wei. The algorithm interprets the SDP output as a denoised version of the original data and then rounds this output to a hard clustering. We provide a generic method for proving performance guarantees for this algorithm, and we analyze the algorithm in the context of subgaussian mixture models. We also study the fundamental limits of estimating Gaussian centers by k-means clustering in order to compare our approximation guarantee to the theoretically optimal k-means clustering solution.

21.0NAJun 25, 2015

The local convexity of solving systems of quadratic equations

Chris D. White, Sujay Sanghavi, Rachel Ward

This paper considers the recovery of a rank $r$ positive semidefinite matrix $X X^T\in\mathbb{R}^{n\times n}$ from $m$ scalar measurements of the form $y_i := a_i^T X X^T a_i$ (i.e., quadratic measurements of $X$). Such problems arise in a variety of applications, including covariance sketching of high-dimensional data streams, quadratic regression, quantum state tomography, among others. A natural approach to this problem is to minimize the loss function $f(U) = \sum_i (y_i - a_i^TUU^Ta_i)^2$ which has an entire manifold of solutions given by $\{XO\}_{O\in\mathcal{O}_r}$ where $\mathcal{O}_r$ is the orthogonal group of $r\times r$ orthogonal matrices; this is {\it non-convex} in the $n\times r$ matrix $U$, but methods like gradient descent are simple and easy to implement (as compared to semidefinite relaxation approaches). In this paper we show that once we have $m \geq C nr \log^2(n)$ samples from isotropic gaussian $a_i$, with high probability {\em (a)} this function admits a dimension-independent region of {\em local strong convexity} on lines perpendicular to the solution manifold, and {\em (b)} with an additional polynomial factor of $r$ samples, a simple spectral initialization will land within the region of convexity with high probability. Together, this implies that gradient descent with initialization (but no re-sampling) will converge linearly to the correct $X$, up to an orthogonal transformation. We believe that this general technique (local convexity reachable by spectral initialization) should prove applicable to a broader class of nonconvex optimization problems.

1.2ITJun 8, 2015

Compressive Sensing with Redundant Dictionaries and Structured Measurements

Felix Krahmer, Deanna Needell, Rachel Ward

Consider the problem of recovering an unknown signal from undersampled measurements, given the knowledge that the signal has a sparse representation in a specified dictionary $D$. This problem is now understood to be well-posed and efficiently solvable under suitable assumptions on the measurements and dictionary, if the number of measurements scales roughly with the sparsity level. One sufficient condition for such is the $D$-restricted isometry property ($D$-RIP), which asks that the sampling matrix approximately preserve the norm of all signals which are sufficiently sparse in $D$. While many classes of random matrices are known to satisfy such conditions, such matrices are not representative of the structural constraints imposed by practical sensing systems. We close this gap in the theory by demonstrating that one can subsample a fixed orthogonal matrix in such a way that the $D$-RIP will hold, provided this basis is sufficiently incoherent with the sparsifying dictionary $D$. We also extend this analysis to allow for weighted sparse expansions. Consequently, we arrive at compressive sensing recovery guarantees for structured measurements and redundant dictionaries, opening the door to a wide array of practical applications.

1.2DSJun 1, 2015

A unified framework for linear dimensionality reduction in L1

Felix Krahmer, Rachel Ward

For a family of interpolation norms $\| \cdot \|_{1,2,s}$ on $\mathbb{R}^n$, we provide a distribution over random matrices $Φ_s \in \mathbb{R}^{m \times n}$ parametrized by sparsity level $s$ such that for a fixed set $X$ of $K$ points in $\mathbb{R}^n$, if $m \geq C s \log(K)$ then with high probability, $\frac{1}{2} \| x \|_{1,2,s} \leq \| Φ_s (x) \|_1 \leq 2 \| x\|_{1,2,s}$ for all $x\in X$. Several existing results in the literature reduce to special cases of this result at different values of $s$: for $s=n$, $\| x\|_{1,2,n} \equiv \| x \|_{1}$ and we recover that dimension reducing linear maps can preserve the $\ell_1$-norm up to a distortion proportional to the dimension reduction factor, which is known to be the best possible such result. For $s=1$, $\|x \|_{1,2,1} \equiv \| x \|_{2}$, and we recover an $\ell_2 / \ell_1$ variant of the Johnson-Lindenstrauss Lemma for Gaussian random matrices. Finally, if $x$ is $s$-sparse, then $\| x \|_{1,2,s} = \| x \|_1$ and we recover that $s$-sparse vectors in $\ell_1^n$ embed into $\ell_1^{\mathcal{O}(s \log(n))}$ via sparse random matrix constructions.

1.2FAMar 26, 2015

Interpolation via weighted $l_1$ minimization

Holger Rauhut, Rachel Ward

Functions of interest are often smooth and sparse in some sense, and both priors should be taken into account when interpolating sampled data. Classical linear interpolation methods are effective under strong regularity assumptions, but cannot incorporate nonlinear sparsity structure. At the same time, nonlinear methods such as $l_1$ minimization can reconstruct sparse functions from very few samples, but do not necessarily encourage smoothness. Here we show that weighted $l_1$ minimization effectively merges the two approaches, promoting both sparsity and smoothness in reconstruction. More precisely, we provide specific choices of weights in the $l_1$ objective to achieve rates for functions with coefficient sequences in weighted $l_p$ spaces, $p<=1$. We consider the implications of these results for spherical harmonic and polynomial interpolation, in the univariate and multivariate setting. Along the way, we extend concepts from compressive sensing such as the restricted isometry property and null space property to accommodate weighted sparse expansions; these developments should be of independent interest in the study of structured sparse approximations and continuous-time compressive sensing problems.

21.3MLAug 18, 2014

Relax, no need to round: integrality of clustering formulations

Pranjal Awasthi, Afonso S. Bandeira, Moses Charikar et al.

We study exact recovery conditions for convex relaxations of point cloud clustering problems, focusing on two of the most common optimization problems for unsupervised clustering: $k$-means and $k$-median clustering. Motivations for focusing on convex relaxations are: (a) they come with a certificate of optimality, and (b) they are generic tools which are relatively parameter-free, not tailored to specific assumptions over the input. More precisely, we consider the distributional setting where there are $k$ clusters in $\mathbb{R}^m$ and data from each cluster consists of $n$ points sampled from a symmetric distribution within a ball of unit radius. We ask: what is the minimal separation distance between cluster centers needed for convex relaxations to exactly recover these $k$ clusters as the optimal integral solution? For the $k$-median linear programming relaxation we show a tight bound: exact recovery is obtained given arbitrarily small pairwise separation $ε> 0$ between the balls. In other words, the pairwise center separation is $Δ> 2+ε$. Under the same distributional model, the $k$-means LP relaxation fails to recover such clusters at separation as large as $Δ= 4$. Yet, if we enforce PSD constraints on the $k$-means LP, we get exact cluster recovery at center separation $Δ> 2\sqrt2(1+\sqrt{1/m})$. In contrast, common heuristics such as Lloyd's algorithm (a.k.a. the $k$-means algorithm) can fail to recover clusters in this setting; even with arbitrarily large cluster separation, k-means++ with overseeding by any constant factor fails with high probability at exact cluster recovery. To complement the theoretical analysis, we provide an experimental study of the recovery guarantees for these various methods, and discuss several open problems which these experiments suggest.

16.7MLApr 28, 2014

One-bit compressive sensing with norm estimation

Karin Knudson, Rayan Saab, Rachel Ward

Consider the recovery of an unknown signal ${x}$ from quantized linear measurements. In the one-bit compressive sensing setting, one typically assumes that ${x}$ is sparse, and that the measurements are of the form $\operatorname{sign}(\langle {a}_i, {x} \rangle) \in \{\pm1\}$. Since such measurements give no information on the norm of ${x}$, recovery methods from such measurements typically assume that $\| {x} \|_2=1$. We show that if one allows more generally for quantized affine measurements of the form $\operatorname{sign}(\langle {a}_i, {x} \rangle + b_i)$, and if the vectors ${a}_i$ are random, an appropriate choice of the affine shifts $b_i$ allows norm recovery to be easily incorporated into existing methods for one-bit compressive sensing. Additionally, we show that for arbitrary fixed ${x}$ in the annulus $r \leq \| {x} \|_2 \leq R$, one may estimate the norm $\| {x} \|_2$ up to additive error $δ$ from $m \gtrsim R^4 r^{-2} δ^{-2}$ such binary measurements through a single evaluation of the inverse Gaussian error function. Finally, all of our recovery guarantees can be made universal over sparse vectors, in the sense that with high probability, one set of measurements and thresholds can successfully estimate all sparse vectors ${x}$ within a Euclidean ball of known radius.

36.2NAOct 21, 2013

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

Deanna Needell, Nathan Srebro, Rachel Ward

We obtain an improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives, improving from a quadratic dependence on the conditioning $(L/μ)^2$ (where $L$ is a bound on the smoothness and $μ$ on the strong convexity) to a linear dependence on $L/μ$. Furthermore, we show how reweighting the sampling distribution (i.e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence in the average smoothness, dominating previous results. We also discuss importance sampling for SGD more broadly and show how it can improve convergence also in other scenarios. Our results are based on a connection we make between SGD and the randomized Kaczmarz algorithm, which allows us to transfer ideas between the separate bodies of literature studying each of the two methods. In particular, we recast the randomized Kaczmarz algorithm as an instance of SGD, and apply our results to prove its exponential convergence, but to the solution of a weighted least squares problem rather than the original least squares problem. We then present a modified Kaczmarz algorithm with partially biased sampling which does converge to the original least squares solution with the same exponential convergence rate.