DSJul 17, 2022
Online Lewis Weight SamplingDavid P. Woodruff, Taisuke Yasuda
The seminal work of Cohen and Peng introduced Lewis weight sampling to the theoretical computer science community, yielding fast row sampling algorithms for approximating $d$-dimensional subspaces of $\ell_p$ up to $(1+ε)$ error. Several works have extended this important primitive to other settings, including the online coreset and sliding window models. However, these results are only for $p\in\{1,2\}$, and results for $p=1$ require a suboptimal $\tilde O(d^2/ε^2)$ samples. In this work, we design the first nearly optimal $\ell_p$ subspace embeddings for all $p\in(0,\infty)$ in the online coreset and sliding window models. In both models, our algorithms store $\tilde O(d^{1\lor(p/2)}/ε^2)$ rows. This answers a substantial generalization of the main open question of [BDMMUWZ2020], and gives the first results for all $p\notin\{1,2\}$. Towards our result, we give the first analysis of "one-shot'' Lewis weight sampling of sampling rows proportionally to their Lewis weights, with sample complexity $\tilde O(d^{p/2}/ε^2)$ for $p>2$. Previously, this scheme was only known to have sample complexity $\tilde O(d^{p/2}/ε^5)$, whereas $\tilde O(d^{p/2}/ε^2)$ is known if a more sophisticated recursive sampling is used. The recursive sampling cannot be implemented online, thus necessitating an analysis of one-shot Lewis weight sampling. Our analysis uses a novel connection to online numerical linear algebra. As an application, we obtain the first one-pass streaming coreset algorithms for $(1+ε)$ approximation of important generalized linear models, such as logistic regression and $p$-probit regression. Our upper bounds are parameterized by a complexity parameter $μ$ introduced by [MSSW2018], and we show the first lower bounds showing that a linear dependence on $μ$ is necessary.
LGSep 29, 2022
Sequential Attention for Feature SelectionTaisuke Yasuda, MohammadHossein Bateni, Lin Chen et al.
Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on $\ell_1$ regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.
DSJun 1, 2023
Sharper Bounds for $\ell_p$ Sensitivity SamplingDavid P. Woodruff, Taisuke Yasuda
In large scale machine learning, random sampling is a popular way to approximate datasets by a small representative subset of examples. In particular, sensitivity sampling is an intensely studied technique which provides provable guarantees on the quality of approximation, while reducing the number of examples to the product of the VC dimension $d$ and the total sensitivity $\mathfrak S$ in remarkably general settings. However, guarantees going beyond this general bound of $\mathfrak S d$ are known in perhaps only one setting, for $\ell_2$ subspace embeddings, despite intense study of sensitivity sampling in prior work. In this work, we show the first bounds for sensitivity sampling for $\ell_p$ subspace embeddings for $p > 2$ that improve over the general $\mathfrak S d$ bound, achieving a bound of roughly $\mathfrak S^{2-2/p}$ for $2<p<\infty$. Furthermore, our techniques yield further new results in the study of sampling algorithms, showing that the root leverage score sampling algorithm achieves a bound of roughly $d$ for $1\leq p<2$, and that a combination of leverage score and sensitivity sampling achieves an improved bound of roughly $d^{2/p}\mathfrak S^{2-4/p}$ for $2<p<\infty$. Our sensitivity sampling results yield the best known sample complexity for a wide class of structured matrices that have small $\ell_p$ sensitivity.
DSOct 29, 2023
Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile StreamingGregory Dexter, Petros Drineas, David P. Woodruff et al.
Sketching algorithms have recently proven to be a powerful approach both for designing low-space streaming algorithms as well as fast polynomial time approximation schemes (PTAS). In this work, we develop new techniques to extend the applicability of sketching-based approaches to the sparse dictionary learning and the Euclidean $k$-means clustering problems. In particular, we initiate the study of the challenging setting where the dictionary/clustering assignment for each of the $n$ input points must be output, which has surprisingly received little attention in prior work. On the fast algorithms front, we obtain a new approach for designing PTAS's for the $k$-means clustering problem, which generalizes to the first PTAS for the sparse dictionary learning problem. On the streaming algorithms front, we obtain new upper bounds and lower bounds for dictionary learning and $k$-means clustering. In particular, given a design matrix $\mathbf A\in\mathbb R^{n\times d}$ in a turnstile stream, we show an $\tilde O(nr/ε^2 + dk/ε)$ space upper bound for $r$-sparse dictionary learning of size $k$, an $\tilde O(n/ε^2 + dk/ε)$ space upper bound for $k$-means clustering, as well as an $\tilde O(n)$ space upper bound for $k$-means clustering on random order row insertion streams with a natural "bounded sensitivity" assumption. On the lower bounds side, we obtain a general $\tildeΩ(n/ε+ dk/ε)$ lower bound for $k$-means clustering, as well as an $\tildeΩ(n/ε^2)$ lower bound for algorithms which can estimate the cost of a single fixed set of candidate centers.
DSJul 3, 2024
Ridge Leverage Score Sampling for $\ell_p$ Subspace ApproximationDavid P. Woodruff, Taisuke Yasuda
The $\ell_p$ subspace approximation problem is an NP-hard low rank approximation problem that generalizes the median hyperplane ($p = 1$), principal component analysis ($p = 2$), and center hyperplane problems ($p = \infty$). A popular approach to cope with the NP-hardness is to compute a strong coreset, which is a weighted subset of input points that simultaneously approximates the cost of every $k$-dimensional subspace, typically to $(1+ε)$ relative error for a small constant $ε$. We obtain an algorithm for constructing a strong coreset for $\ell_p$ subspace approximation of size $\tilde O(kε^{-4/p})$ for $p<2$ and $\tilde O(k^{p/2}ε^{-p})$ for $p>2$. This offers the following improvements over prior work: - We construct the first strong coresets with nearly optimal dependence on $k$ for all $p\neq 2$. In prior work, [SW18] constructed coresets of modified points with a similar dependence on $k$, while [HV20] constructed true coresets with polynomially worse dependence on $k$. - We recover or improve the best known $ε$ dependence for all $p$. In particular, for $p > 2$, the [SW18] coreset of modified points had a dependence of $ε^{-p^2/2}$ and the [HV20] coreset had a dependence of $ε^{-3p}$. Our algorithm is based on sampling by root ridge leverage scores, which admits fast algorithms, especially for sparse or structured matrices. Our analysis avoids the use of the representative subspace theorem [SW18], which is a critical component of all prior dimension-independent coresets for $\ell_p$ subspace approximation. Our techniques also lead to the first nearly optimal online strong coresets for $\ell_p$ subspace approximation with similar bounds as the offline setting, resolving a problem of [WY23]. All prior approaches lose $\mathrm{poly}(k)$ factors in this setting, even when allowed to modify the original points.
LGJul 14, 2023
Performance of $\ell_1$ Regularization for Sparse Convex OptimizationKyriakos Axiotis, Taisuke Yasuda
Despite widespread adoption in practice, guarantees for the LASSO and Group LASSO are strikingly lacking in settings beyond statistical problems, and these algorithms are usually considered to be a heuristic in the context of sparse convex optimization on deterministic inputs. We give the first recovery guarantees for the Group LASSO for sparse convex optimization with vector-valued features. We show that if a sufficiently large Group LASSO regularization is applied when minimizing a strictly convex function $l$, then the minimizer is a sparse vector supported on vector-valued features with the largest $\ell_2$ norm of the gradient. Thus, repeating this procedure selects the same set of features as the Orthogonal Matching Pursuit algorithm, which admits recovery guarantees for any function $l$ with restricted strong convexity and smoothness via weak submodularity arguments. This answers open questions of Tibshirani et al. and Yasuda et al. Our result is the first to theoretically explain the empirical success of the Group LASSO for convex functions under general input instances assuming only restricted strong convexity and smoothness. Our result also generalizes provable guarantees for the Sequential Attention algorithm, which is a feature selection algorithm inspired by the attention mechanism proposed by Yasuda et al. As an application of our result, we give new results for the column subset selection problem, which is well-studied when the loss is the Frobenius norm or other entrywise matrix losses. We give the first result for general loss functions for this problem that requires only restricted strong convexity and smoothness.
LGFeb 27, 2024
SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial OptimizationTaisuke Yasuda, Kyriakos Axiotis, Gang Fu et al.
Neural network pruning is a key technique towards engineering large yet scalable, interpretable, and generalizable models. Prior work on the subject has developed largely along two orthogonal directions: (1) differentiable pruning for efficiently and accurately scoring the importance of parameters, and (2) combinatorial optimization for efficiently searching over the space of sparse models. We unite the two approaches, both theoretically and empirically, to produce a coherent framework for structured neural network pruning in which differentiable pruning guides combinatorial optimization algorithms to select the most important sparse set of parameters. Theoretically, we show how many existing differentiable pruning techniques can be understood as nonconvex regularization for group sparse optimization, and prove that for a wide class of nonconvex regularizers, the global optimum is unique, group-sparse, and provably yields an approximate solution to a sparse convex optimization problem. The resulting algorithm that we propose, SequentialAttention++, advances the state of the art in large-scale neural network block-wise pruning tasks on the ImageNet and Criteo datasets.
DSJan 3, 2025
John Ellipsoids via Lazy UpdatesDavid P. Woodruff, Taisuke Yasuda
We give a faster algorithm for computing an approximate John ellipsoid around $n$ points in $d$ dimensions. The best known prior algorithms are based on repeatedly computing the leverage scores of the points and reweighting them by these scores [CCLY19]. We show that this algorithm can be substantially sped up by delaying the computation of high accuracy leverage scores by using sampling, and then later computing multiple batches of high accuracy leverage scores via fast rectangular matrix multiplication. We also give low-space streaming algorithms for John ellipsoids using similar ideas.
DSJun 4, 2024
Coresets for Multiple $\ell_p$ RegressionDavid P. Woodruff, Taisuke Yasuda
A coreset of a dataset with $n$ examples and $d$ features is a weighted subset of examples that is sufficient for solving downstream data analytic tasks. Nearly optimal constructions of coresets for least squares and $\ell_p$ linear regression with a single response are known in prior work. However, for multiple $\ell_p$ regression where there can be $m$ responses, there are no known constructions with size sublinear in $m$. In this work, we construct coresets of size $\tilde O(\varepsilon^{-2}d)$ for $p<2$ and $\tilde O(\varepsilon^{-p}d^{p/2})$ for $p>2$ independently of $m$ (i.e., dimension-free) that approximate the multiple $\ell_p$ regression objective at every point in the domain up to $(1\pm\varepsilon)$ relative error. If we only need to preserve the minimizer subject to a subspace constraint, we improve these bounds by an $\varepsilon$ factor for all $p>1$. All of our bounds are nearly tight. We give two application of our results. First, we settle the number of uniform samples needed to approximate $\ell_p$ Euclidean power means up to a $(1+\varepsilon)$ factor, showing that $\tildeΘ(\varepsilon^{-2})$ samples for $p = 1$, $\tildeΘ(\varepsilon^{-1})$ samples for $1 < p < 2$, and $\tildeΘ(\varepsilon^{1-p})$ samples for $p>2$ is tight, answering a question of Cohen-Addad, Saulpic, and Schwiegelshohn. Second, we show that for $1<p<2$, every matrix has a subset of $\tilde O(\varepsilon^{-1}k)$ rows which spans a $(1+\varepsilon)$-approximately optimal $k$-dimensional subspace for $\ell_p$ subspace approximation, which is also nearly optimal.
DSJun 4, 2024
Reweighted Solutions for Weighted Low Rank ApproximationDavid P. Woodruff, Taisuke Yasuda
Weighted low rank approximation (WLRA) is an important yet computationally challenging primitive with applications ranging from statistical analysis, model compression, and signal processing. To cope with the NP-hardness of this problem, prior work considers heuristics, bicriteria, or fixed parameter tractable algorithms to solve this problem. In this work, we introduce a new relaxed solution to WLRA which outputs a matrix that is not necessarily low rank, but can be stored using very few parameters and gives provable approximation guarantees when the weight matrix has low rank. Our central idea is to use the weight matrix itself to reweight a low rank solution, which gives an extremely simple algorithm with remarkable empirical performance in applications to model compression and on synthetic datasets. Our algorithm also gives nearly optimal communication complexity bounds for a natural distributed problem associated with this problem, for which we show matching communication lower bounds. Together, our communication complexity bounds show that the rank of the weight matrix provably parameterizes the communication complexity of WLRA. We also obtain the first relative error guarantees for feature selection with a weighted objective.
LGNov 9, 2021
Active Linear Regression for $\ell_p$ Norms and BeyondCameron Musco, Christopher Musco, David P. Woodruff et al.
We study active sampling algorithms for linear regression, which aim to query only a few entries of a target vector $b\in\mathbb R^n$ and output a near minimizer to $\min_{x\in\mathbb R^d} \|Ax-b\|$, for a design matrix $A\in\mathbb R^{n \times d}$ and loss $\|\cdot\|$. For $p$ norm regression for any $0<p<\infty$, we give an algorithm based on Lewis weight sampling outputting a $(1+ε)$-approximate solution using just $\tilde O(d/ε^2)$ queries to $b$ for $p\in(0,1)$, $\tilde{O}(d/ε)$ queries for $1<p<2$, and $\tilde{O}(d^{p/2}/ε^p)$ queries for $2<p<\infty$. For $0<p<2$, our bounds are optimal up to log factors, settling the query complexity for this range. For $2<p<\infty$, our dependence on $d$ is optimal, while our dependence on $ε$ is off by at most $ε$, up to log factors. Our result resolves an open question of [CD21], who gave near optimal bounds for the $1$ norm, but required $d^2/ε^2$ samples for $\ell_p$ regression with $1<p<2$, and gave no bounds for $2<p<\infty$ or $0<p<1$. We also give the first total sensitivity bound of $O(d^{\max\{1,p/2\}}\log^2n)$ for loss functions of degree $p$ polynomial growth, improving a result of [TMF20]. By combining this with our techniques for $\ell_p$ regression, we obtain an active regression algorithm making $\tilde O(d^{1+\max\{1,p/2\}}/\mathrm{poly}(ε))$ queries for such loss functions, including the Tukey and Huber losses, answering another question of [CD21]. For the Huber loss, we further improve our bound to $\tilde O(d^{4-2\sqrt2}/\mathrm{poly}(ε))$ samples. Our sensitivity bounds also have many applications, including Orlicz norm subspace embeddings, robust subspace approximation, and dimension reduction for smoothed $p$-norms. Finally, our active sampling results give the first sublinear time algorithms for Kronecker product regression under every $p$ norm.
DSMay 15, 2019
Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel $k$-means ClusteringManuel Fernandez, David P. Woodruff, Taisuke Yasuda
We present tight lower bounds on the number of kernel evaluations required to approximately solve kernel ridge regression (KRR) and kernel $k$-means clustering (KKMC) on $n$ input points. For KRR, our bound for relative error approximation to the minimizer of the objective function is $Ω(nd_{\mathrm{eff}}^λ/\varepsilon)$ where $d_{\mathrm{eff}}^λ$ is the effective statistical dimension, which is tight up to a $\log(d_{\mathrm{eff}}^λ/\varepsilon)$ factor. For KKMC, our bound for finding a $k$-clustering achieving a relative error approximation of the objective function is $Ω(nk/\varepsilon)$, which is tight up to a $\log(k/\varepsilon)$ factor. Our KRR result resolves a variant of an open question of El Alaoui and Mahoney, asking whether the effective statistical dimension is a lower bound on the sampling complexity or not. Furthermore, for the important practical case when the input is a mixture of Gaussians, we provide a KKMC algorithm which bypasses the above lower bound.