Alexander Munteanu

DS
h-index10
17papers
231citations
Novelty58%
AI Score53

17 Papers

LGJun 26, 2022
Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis

Alexander Munteanu, Simon Omlor, Zhao Song et al.

A common method in training neural networks is to initialize all the weights to be independent Gaussian vectors. We observe that by instead initializing the weights into independent pairs, where each pair consists of two identical Gaussian vectors, we can significantly improve the convergence analysis. While a similar technique has been studied for random inputs [Daniely, NeurIPS 2020], it has not been analyzed with arbitrary inputs. Using this technique, we show how to significantly reduce the number of neurons required for two-layer ReLU networks, both in the under-parameterized setting with logistic loss, from roughly $γ^{-8}$ [Ji and Telgarsky, ICLR 2020] to $γ^{-2}$, where $γ$ denotes the separation margin with a Neural Tangent Kernel, as well as in the over-parameterized setting with squared loss, from roughly $n^4$ [Song and Yang, 2019] to $n^2$, implicitly also improving the recent running time bound of [Brand, Peng, Song and Weinstein, ITCS 2021]. For the under-parameterized setting we also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.

DSMar 31, 2023
Almost Linear Constant-Factor Sketching for $\ell_1$ and Logistic Regression

Alexander Munteanu, Simon Omlor, David Woodruff

We improve upon previous oblivious sketching and turnstile streaming results for $\ell_1$ and logistic regression, giving a much smaller sketching dimension achieving $O(1)$-approximation and yielding an efficient optimization problem in the sketch space. Namely, we achieve for any constant $c>0$ a sketching dimension of $\tilde{O}(d^{1+c})$ for $\ell_1$ regression and $\tilde{O}(μd^{1+c})$ for logistic regression, where $μ$ is a standard measure that captures the complexity of compressing the data. For $\ell_1$-regression our sketching dimension is near-linear and improves previous work which either required $Ω(\log d)$-approximation with this sketching dimension, or required a larger $\operatorname{poly}(d)$ number of rows. Similarly, for logistic regression previous work had worse $\operatorname{poly}(μd)$ factors in its sketching dimension. We also give a tradeoff that yields a $1+\varepsilon$ approximation in input sparsity time by increasing the total size to $(d\log(n)/\varepsilon)^{O(1/\varepsilon)}$ for $\ell_1$ and to $(μd\log(n)/\varepsilon)^{O(1/\varepsilon)}$ for logistic regression. Finally, we show that our sketch can be extended to approximate a regularized version of logistic regression where the data-dependent regularizer corresponds to the variance of the individual logistic losses.

DSApr 5, 2023
Optimal Sketching Bounds for Sparse Linear Regression

Tung Mai, Alexander Munteanu, Cameron Musco et al.

We study oblivious sketching for $k$-sparse linear regression under various loss functions such as an $\ell_p$ norm, or from a broad class of hinge-like loss functions, which includes the logistic and ReLU losses. We show that for sparse $\ell_2$ norm regression, there is a distribution over oblivious sketches with $Θ(k\log(d/k)/\varepsilon^2)$ rows, which is tight up to a constant factor. This extends to $\ell_p$ loss with an additional additive $O(k\log(k/\varepsilon)/\varepsilon^2)$ term in the upper bound. This establishes a surprising separation from the related sparse recovery problem, which is an important special case of sparse regression. For this problem, under the $\ell_2$ norm, we observe an upper bound of $O(k \log (d)/\varepsilon + k\log(k/\varepsilon)/\varepsilon^2)$ rows, showing that sparse recovery is strictly easier to sketch than sparse regression. For sparse regression under hinge-like loss functions including sparse logistic and sparse ReLU regression, we give the first known sketching bounds that achieve $o(d)$ rows showing that $O(μ^2 k\log(μn d/\varepsilon)/\varepsilon^2)$ rows suffice, where $μ$ is a natural complexity parameter needed to obtain relative error bounds for these loss functions. We again show that this dimension is tight, up to lower order terms and the dependence on $μ$. Finally, we show that similar sketching bounds can be achieved for LASSO regression, a popular convex relaxation of sparse regression, where one aims to minimize $\|Ax-b\|_2^2+λ\|x\|_1$ over $x\in\mathbb{R}^d$. We show that sketching dimension $O(\log(d)/(λ\varepsilon)^2)$ suffices and that the dependence on $d$ and $λ$ is tight.

LGMay 22
Optimal Dimension-Free Sampling for Regularized Classification

Meysam Alishahi, Alexander Munteanu, Simon Omlor et al.

We prove optimal sampling bounds achieving $(1\pm\varepsilon)$-relative error for a broad class of Lipschitz continuous classification loss functions under various regularization terms. This includes important functions such as logistic and sigmoid loss, hinge loss, and ReLU loss, as prominent and popular representative examples. In particular, we prove $k^2/\varepsilon^2$ upper and lower bounds for $\|\cdot\|_2/k$ regularization, and $k/\varepsilon^2$ upper and lower bounds for $\|\cdot\|_1/k$ regularization. For $\|\cdot\|_2^2/k$ regularization, the sampling complexity depends mainly on a bounded derivative property: if $|g'(x)|\leq g(x)$, and $g(0)>0$, and $g$ is monotonic or convex, then it admits linear in $k$ sampling complexity; otherwise the general bound is $k^2/\varepsilon^2$. However, if $g(0)=0$, our results indicate that no dimension-free bounds are possible, and even sublinear bounds are ruled out. All upper bounds are complemented by matching lower bounds up to polylogarithmic terms. Moreover, our work relies conceptually and algorithmically on simple uniform or (squared) norm sampling and hereby improves over recent cubic $k^3/\varepsilon^2$ sensitivity sampling bounds of (Alishahi and Phillips, ICML'24). This is achieved by refined arguments involving higher moment bounds and empirical process analyses to avoid overcounting that appears in the de-facto standard VC-dimension and sensitivity framework.

DSMar 25, 2022
$p$-Generalized Probit Regression and Scalable Maximum Likelihood Estimation via Sketching and Coresets

Alexander Munteanu, Simon Omlor, Christian Peters

We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses. It extends the standard probit model by replacing its link function, the standard normal cdf, by a $p$-generalized normal distribution for $p\in[1, \infty)$. The $p$-generalized normal distributions \citep{Sub23} are of special interest in statistical modeling because they fit much more flexibly to data. Their tail behavior can be controlled by choice of the parameter $p$, which influences the model's sensitivity to outliers. Special cases include the Laplace, the Gaussian, and the uniform distributions. We further show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+\varepsilon)$ on large data by combining sketching techniques with importance subsampling to obtain a small data summary called coreset.

LGMar 20
Scalable Learning of Multivariate Distributions via Coresets

Zeyu Ding, Katja Ickstadt, Nadja Klein et al.

Efficient and scalable non-parametric or semi-parametric regression analysis and density estimation are of crucial importance to the fields of statistics and machine learning. However, available methods are limited in their ability to handle large-scale data. We address this issue by developing a novel coreset construction for multivariate conditional transformation models (MCTMs) to enhance their scalability and training efficiency. To the best of our knowledge, these are the first coresets for semi-parametric distributional models. Our approach yields substantial data reduction via importance sampling. It ensures with high probability that the log-likelihood remains within multiplicative error bounds of $(1\pm\varepsilon)$ and thereby maintains statistical model accuracy. Compared to conventional full-parametric models, where coresets have been incorporated before, our semi-parametric approach exhibits enhanced adaptability, particularly in scenarios where complex distributions and non-linear relationships are present, but not fully understood. To address numerical problems associated with normalizing logarithmic terms, we follow a geometric approximation based on the convex hull of input data. This ensures feasible, stable, and accurate inference in scenarios involving large amounts of data. Numerical experiments demonstrate substantially improved computational efficiency when handling large and complex datasets, thus laying the foundation for a broad range of applications within the statistics and machine learning communities.

CGMar 19
Hardness of High-Dimensional Linear Classification

Alexander Munteanu, Simon Omlor, Jeff M. Phillips

We establish new exponential in dimension lower bounds for the Maximum Halfspace Discrepancy problem, which models linear classification. Both are fundamental problems in computational geometry and machine learning in their exact and approximate forms. However, only $O(n^d)$ and respectively $\tilde O(1/\varepsilon^d)$ upper bounds are known and complemented by polynomial lower bounds that do not support the exponential in dimension dependence. We close this gap up to polylogarithmic terms by reduction from widely-believed hardness conjectures for Affine Degeneracy testing and $k$-Sum problems. Our reductions yield matching lower bounds of $\tildeΩ(n^d)$ and respectively $\tildeΩ(1/\varepsilon^d)$ based on Affine Degeneracy testing, and $\tildeΩ(n^{d/2})$ and respectively $\tildeΩ(1/\varepsilon^{d/2})$ conditioned on $k$-Sum. The first bound also holds unconditionally if the computational model is restricted to make sidedness queries, which corresponds to a widely spread setting implemented and optimized in many contemporary algorithms and computing paradigms.

LGMar 1, 2024
Scalable Learning of Item Response Theory Models

Susanne Frick, Amer Krivošija, Alexander Munteanu

Item Response Theory (IRT) models aim to assess latent abilities of $n$ examinees along with latent difficulty characteristics of $m$ test items from categorical data that indicates the quality of their corresponding answers. Classical psychometric assessments are based on a relatively small number of examinees and items, say a class of $200$ students solving an exam comprising $10$ problems. More recent global large scale assessments such as PISA, or internet studies, may lead to significantly increased numbers of participants. Additionally, in the context of Machine Learning where algorithms take the role of examinees and data analysis problems take the role of items, both $n$ and $m$ may become very large, challenging the efficiency and scalability of computations. To learn the latent variables in IRT models from large data, we leverage the similarity of these models to logistic regression, which can be approximated accurately using small weighted subsets called coresets. We develop coresets for their use in alternating IRT training algorithms, facilitating scalable learning from large data.

DSMay 29, 2025
Improved Learning via k-DTW: A Novel Dissimilarity Measure for Curves

Amer Krivošija, Alexander Munteanu, André Nusser et al.

This paper introduces $k$-Dynamic Time Warping ($k$-DTW), a novel dissimilarity measure for polygonal curves. $k$-DTW has stronger metric properties than Dynamic Time Warping (DTW) and is more robust to outliers than the Fréchet distance, which are the two gold standards of dissimilarity measures for polygonal curves. We show interesting properties of $k$-DTW and give an exact algorithm as well as a $(1+\varepsilon)$-approximation algorithm for $k$-DTW by a parametric search for the $k$-th largest matched distance. We prove the first dimension-free learning bounds for curves and further learning theoretic results. $k$-DTW not only admits smaller sample size than DTW for the problem of learning the median of curves, where some factors depending on the curves' complexity $m$ are replaced by $k$, but we also show a surprising separation on the associated Rademacher and Gaussian complexities: $k$-DTW admits strictly smaller bounds than DTW, by a factor $\tildeΩ(\sqrt{m})$ when $k\ll m$. We complement our theoretical findings with an experimental illustration of the benefits of using $k$-DTW for clustering and nearest neighbor classification.

LGOct 30, 2024
Data subsampling for Poisson regression with pth-root-link

Han Cheng Lie, Alexander Munteanu

We develop and analyze data subsampling techniques for Poisson regression, the standard model for count data $y\in\mathbb{N}$. In particular, we consider the Poisson generalized linear model with ID- and square root-link functions. We consider the method of coresets, which are small weighted subsets that approximate the loss function of Poisson regression up to a factor of $1\pm\varepsilon$. We show $Ω(n)$ lower bounds against coresets for Poisson regression that continue to hold against arbitrary data reduction techniques up to logarithmic factors. By introducing a novel complexity parameter and a domain shifting approach, we show that sublinear coresets with $1\pm\varepsilon$ approximation guarantee exist when the complexity parameter is small. In particular, the dependence on the number of input points can be reduced to polylogarithmic. We show that the dependence on other input parameters can also be bounded sublinearly, though not always logarithmically. In particular, we show that the square root-link admits an $O(\log(y_{\max}))$ dependence, where $y_{\max}$ denotes the largest count presented in the data, while the ID-link requires a $Θ(\sqrt{y_{\max}/\log(y_{\max})})$ dependence. As an auxiliary result for proving the tightness of the bound with respect to $y_{\max}$ in the case of the ID-link, we show an improved bound on the principal branch of the Lambert $W_0$ function, which may be of independent interest. We further show the limitations of our analysis when $p$th degree root-link functions for $p\geq 3$ are considered, which indicate that other analytical or computational methods would be required if such a generalization is even possible.

DSJun 1, 2024
Turnstile $\ell_p$ leverage score sampling with applications

Alexander Munteanu, Simon Omlor

The turnstile data stream model offers the most flexible framework where data can be manipulated dynamically, i.e., rows, columns, and even single entries of an input matrix can be added, deleted, or updated multiple times in a data stream. We develop a novel algorithm for sampling rows $a_i$ of a matrix $A\in\mathbb{R}^{n\times d}$, proportional to their $\ell_p$ norm, when $A$ is presented in a turnstile data stream. Our algorithm not only returns the set of sampled row indexes, it also returns slightly perturbed rows $\tilde{a}_i \approx a_i$, and approximates their sampling probabilities up to $\varepsilon$ relative error. When combined with preconditioning techniques, our algorithm extends to $\ell_p$ leverage score sampling over turnstile data streams. With these properties in place, it allows us to simulate subsampling constructions of coresets for important regression problems to operate over turnstile data streams with very little overhead compared to their respective off-line subsampling algorithms. For logistic regression, our framework yields the first algorithm that achieves a $(1+\varepsilon)$ approximation and works in a turnstile data stream using polynomial sketch/subsample size, improving over $O(1)$ approximations, or $\exp(1/\varepsilon)$ sketch size of previous work. We compare experimentally to plain oblivious sketching and plain leverage score sampling algorithms for $\ell_p$ and logistic regression.

DSJun 1, 2024
Optimal bounds for $\ell_p$ sensitivity sampling via $\ell_2$ augmentation

Alexander Munteanu, Simon Omlor

Data subsampling is one of the most natural methods to approximate a massively large data set by a small representative proxy. In particular, sensitivity sampling received a lot of attention, which samples points proportional to an individual importance measure called sensitivity. This framework reduces in very general settings the size of data to roughly the VC dimension $d$ times the total sensitivity $\mathfrak S$ while providing strong $(1\pm\varepsilon)$ guarantees on the quality of approximation. The recent work of Woodruff & Yasuda (2023c) improved substantially over the general $\tilde O(\varepsilon^{-2}\mathfrak Sd)$ bound for the important problem of $\ell_p$ subspace embeddings to $\tilde O(\varepsilon^{-2}\mathfrak S^{2/p})$ for $p\in[1,2]$. Their result was subsumed by an earlier $\tilde O(\varepsilon^{-2}\mathfrak Sd^{1-p/2})$ bound which was implicitly given in the work of Chen & Derezinski (2021). We show that their result is tight when sampling according to plain $\ell_p$ sensitivities. We observe that by augmenting the $\ell_p$ sensitivities by $\ell_2$ sensitivities, we obtain better bounds improving over the aforementioned results to optimal linear $\tilde O(\varepsilon^{-2}(\mathfrak S+d)) = \tilde O(\varepsilon^{-2}d)$ sampling complexity for all $p \in [1,2]$. In particular, this resolves an open question of Woodruff & Yasuda (2023c) in the affirmative for $p \in [1,2]$ and brings sensitivity subsampling into the regime that was previously only known to be possible using Lewis weights (Cohen & Peng, 2015). As an application of our main result, we also obtain an $\tilde O(\varepsilon^{-2}μd)$ sensitivity sampling bound for logistic regression, where $μ$ is a natural complexity measure for this problem. This improves over the previous $\tilde O(\varepsilon^{-2}μ^2 d)$ bound of Mai et al. (2021) which was based on Lewis weights subsampling.

DSJul 14, 2021
Oblivious sketching for logistic regression

Alexander Munteanu, Simon Omlor, David Woodruff

What guarantees are possible for solving logistic regression in one pass over a data stream? To answer this question, we present the first data oblivious sketch for logistic regression. Our sketch can be computed in input sparsity time over a turnstile data stream and reduces the size of a $d$-dimensional data set from $n$ to only $\operatorname{poly}(μd\log n)$ weighted points, where $μ$ is a useful parameter which captures the complexity of compressing the data. Solving (weighted) logistic regression on the sketch gives an $O(\log n)$-approximation to the original problem on the full data set. We also show how to obtain an $O(1)$-approximation with slight modifications. Our sketches are fast, simple, easy to implement, and our experiments demonstrate their practicality.

LGJul 16, 2019
Random Projections and Sampling Algorithms for Clustering of High-Dimensional Polygonal Curves

Stefan Meintrup, Alexander Munteanu, Dennis Rohde

We study the $k$-median clustering problem for high-dimensional polygonal curves with finite but unbounded number of vertices. We tackle the computational issue that arises from the high number of dimensions by defining a Johnson-Lindenstrauss projection for polygonal curves. We analyze the resulting error in terms of the Fréchet distance, which is a tractable and natural dissimilarity measure for curves. Our clustering algorithms achieve sublinear dependency on the number of input curves via subsampling. Also, we show that the Fréchet distance can not be approximated within any factor of less than $\sqrt{2}$ by probabilistically reducing the dependency on the number of vertices of the curves. As a consequence we provide a fast, CUDA-parallelized version of the Alt and Godau algorithm for computing the Fréchet distance and use it to evaluate our results empirically.

CGFeb 28, 2019
Probabilistic smallest enclosing ball in high dimensions via subgradient sampling

Amer Krivošija, Alexander Munteanu

We study a variant of the median problem for a collection of point sets in high dimensions. This generalizes the geometric median as well as the (probabilistic) smallest enclosing ball (pSEB) problems. Our main objective and motivation is to improve the previously best algorithm for the pSEB problem by reducing its exponential dependence on the dimension to linear. This is achieved via a novel combination of sampling techniques for clustering problems in metric spaces with the framework of stochastic subgradient descent. As a result, the algorithm becomes applicable to shape fitting problems in Hilbert spaces of unbounded dimension via kernel functions. We present an exemplary application by extending the support vector data description (SVDD) shape fitting method to the probabilistic case. This is done by simulating the pSEB algorithm implicitly in the feature space induced by the kernel function.

DSMay 22, 2018
On Coresets for Logistic Regression

Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler et al.

Coresets are one of the central methods to facilitate the analysis of large data sets. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show a negative result, namely, that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $μ(X)$, which quantifies the hardness of compressing a data set for logistic regression. $μ(X)$ has an intuitive statistical interpretation that may be of independent interest. For data sets with bounded $μ(X)$-complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear $(1\pm\varepsilon)$-coreset. We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression.

AIOct 9, 2017
Coresets for Dependency Networks

Alejandro Molina, Alexander Munteanu, Kristian Kersting

Many applications infer the structure of a probabilistic graphical model from data to elucidate the relationships between variables. But how can we train graphical models on a massive data set? In this paper, we show how to construct coresets -compressed data sets which can be used as proxy for the original data and have provably bounded worst case error- for Gaussian dependency networks (DNs), i.e., cyclic directed graphical models over Gaussians, where the parents of each variable are its Markov blanket. Specifically, we prove that Gaussian DNs admit coresets of size independent of the size of the data set. Unfortunately, this does not extend to DNs over members of the exponential family in general. As we will prove, Poisson DNs do not admit small coresets. Despite this worst-case result, we will provide an argument why our coreset construction for DNs can still work well in practice on count data. To corroborate our theoretical results, we empirically evaluated the resulting Core DNs on real data sets. The results