Roman Vershynin

NA
h-index4
27papers
7,224citations
Novelty61%
AI Score49

27 Papers

PRNov 23, 2011
Introduction to the non-asymptotic analysis of random matrices

Roman Vershynin

This is a tutorial on some basic non-asymptotic methods and concepts in random matrix theory. The reader will learn several tools for the analysis of the extreme singular values of random matrices with independent rows or columns. Many of these methods sprung off from the development of geometric functional analysis since the 1970's. They have applications in several fields, most notably in theoretical computer science, statistics and signal processing. A few basic applications are covered in this text, particularly for the problem of estimating covariance matrices in statistics and for validating probabilistic constructions of measurement matrices in compressed sensing. These notes are written particularly for graduate students and beginning researchers in different areas, including functional analysts, probabilists, theoretical statisticians, electrical engineers, and theoretical computer scientists.

NADec 9, 2007
Signal Recovery from Incomplete and Inaccurate Measurements via Regularized Orthogonal Matching Pursuit

Deanna Needell, Roman Vershynin

We demonstrate a simple greedy algorithm that can reliably recover a d-dimensional vector v from incomplete and inaccurate measurements x. Here our measurement matrix is an N by d matrix with N much smaller than d. Our algorithm, Regularized Orthogonal Matching Pursuit (ROMP), seeks to close the gap between two major approaches to sparse recovery. It combines the speed and ease of implementation of the greedy methods with the strong guarantees of the convex programming methods. For any measurement matrix that satisfies a Uniform Uncertainty Principle, ROMP recovers a signal with O(n) nonzeros from its inaccurate measurements x in at most n iterations, where each iteration amounts to solving a Least Squares Problem. The noise level of the recovery is proportional to the norm of the error, up to a log factor. In particular, if the error vanishes the reconstruction is exact. This stability result extends naturally to the very accurate recovery of approximately sparse signals.

66.3PRMay 26
On the Subgaussianity of Quantized Linear Maps: An AI-Assisted Note

Guangyi Zou, Roman Vershynin

This short note presents a dimension-independent subgaussian concentration bound for Gaussian vectors under coordinate-wise nonlinear mappings. Discovered by Gemini 3.5 Flash, this result applies to any bounded function under a well-conditioned covariance. We apply this tool to answer a question of Simone Bombari on sign-quantized linear maps $Y = \text{sgn}(Wx)$.

QMMay 31, 2022
AVIDA: Alternating method for Visualizing and Integrating Data

Kathryn Dover, Zixuan Cang, Anna Ma et al.

High-dimensional multimodal data arises in many scientific fields. The integration of multimodal data becomes challenging when there is no known correspondence between the samples and the features of different datasets. To tackle this challenge, we introduce AVIDA, a framework for simultaneously performing data alignment and dimension reduction. In the numerical experiments, Gromov-Wasserstein optimal transport and t-distributed stochastic neighbor embedding are used as the alignment and dimension reduction modules respectively. We show that AVIDA correctly aligns high-dimensional datasets without common features with four synthesized datasets and two real multimodal single-cell datasets. Compared to several existing methods, we demonstrate that AVIDA better preserves structures of individual datasets, especially distinct local structures in the joint low-dimensional visualization, while achieving comparable alignment performance. Such a property is important in multimodal single-cell data analysis as some biological processes are uniquely captured by one of the datasets. In general applications, other methods can be used for the alignment and dimension reduction modules.

NAMar 15, 2008
Uniform Uncertainty Principle and signal recovery via Regularized Orthogonal Matching Pursuit

Deanna Needell, Roman Vershynin

This paper seeks to bridge the two major algorithmic approaches to sparse signal recovery from an incomplete set of linear measurements -- L_1-minimization methods and iterative methods (Matching Pursuits). We find a simple regularized version of the Orthogonal Matching Pursuit (ROMP) which has advantages of both approaches: the speed and transparency of OMP and the strong uniform guarantees of the L_1-minimization. Our algorithm ROMP reconstructs a sparse signal in a number of iterations linear in the sparsity (in practice even logarithmic), and the reconstruction is exact provided the linear measurements satisfy the Uniform Uncertainty Principle.

CRAug 31, 2024
Differentially Private Synthetic High-dimensional Tabular Stream

Girish Kumar, Thomas Strohmer, Roman Vershynin

While differentially private synthetic data generation has been explored extensively in the literature, how to update this data in the future if the underlying private data changes is much less understood. We propose an algorithmic framework for streaming data that generates multiple synthetic datasets over time, tracking changes in the underlying private data. Our algorithm satisfies differential privacy for the entire input stream (continual differential privacy) and can be used for high-dimensional tabular data. Furthermore, we show the utility of our method via experiments on real-world datasets. The proposed algorithm builds upon a popular select, measure, fit, and iterate paradigm (used by offline synthetic data generation algorithms) and private counters for streams.

STFeb 12, 2024
Online Differentially Private Synthetic Data Generation

Yiyun He, Roman Vershynin, Yizhe Zhu

We present a polynomial-time algorithm for online differentially private synthetic data generation. For a data stream within the hypercube $[0,1]^d$ and an infinite time horizon, we develop an online algorithm that generates a differentially private synthetic dataset at each time $t$. This algorithm achieves a near-optimal accuracy bound of $O(\log(t)t^{-1/d})$ for $d\geq 2$ and $O(\log^{4.5}(t)t^{-1})$ for $d=1$ in the 1-Wasserstein distance. This result extends the previous work on the continual release model for counting queries to Lipschitz queries. Compared to the offline case, where the entire dataset is available at once, our approach requires only an extra polylog factor in the accuracy bound.

CRMay 2, 2025
LLM Watermarking Using Mixtures and Statistical-to-Computational Gaps

Pedro Abdalla, Roman Vershynin

Given a text, can we determine whether it was generated by a large language model (LLM) or by a human? A widely studied approach to this problem is watermarking. We propose an undetectable and elementary watermarking scheme in the closed setting. Also, in the harder open setting, where the adversary has access to most of the model, we propose an unremovable watermarking scheme.

DBJan 26, 2024
An Algorithm for Streaming Differentially Private Data

Girish Kumar, Thomas Strohmer, Roman Vershynin

Much of the research in differential privacy has focused on offline applications with the assumption that all data is available at once. When these algorithms are applied in practice to streams where data is collected over time, this either violates the privacy guarantees or results in poor utility. We derive an algorithm for differentially private synthetic streaming data generation, especially curated towards spatial datasets. Furthermore, we provide a general framework for online selective counting among a collection of queries which forms a basis for many tasks such as query answering and synthetic data generation. The utility of our algorithm is verified on both real-world and simulated datasets.

LGMay 26, 2023
Differentially Private Low-dimensional Synthetic Data from High-dimensional Datasets

Yiyun He, Thomas Strohmer, Roman Vershynin et al.

Differentially private synthetic data provide a powerful mechanism to enable data analysis while protecting sensitive information about individuals. However, when the data lie in a high-dimensional space, the accuracy of the synthetic data suffers from the curse of dimensionality. In this paper, we propose a differentially private algorithm to generate low-dimensional synthetic data efficiently from a high-dimensional dataset with a utility guarantee with respect to the Wasserstein distance. A key step of our algorithm is a private principal component analysis (PCA) procedure with a near-optimal accuracy bound that circumvents the curse of dimensionality. Unlike the standard perturbation analysis, our analysis of private PCA works without assuming the spectral gap for the covariance matrix.

LGFeb 15, 2022
The Quarks of Attention

Pierre Baldi, Roman Vershynin

Attention plays a fundamental role in both natural and artificial intelligence systems. In deep learning, attention-based neural architectures, such as transformer architectures, are widely used to tackle problems in natural language processing and beyond. Here we investigate the fundamental building blocks of attention and their computational properties. Within the standard model of deep learning, we classify all possible fundamental building blocks of attention in terms of their source, target, and computational mechanism. We identify and study three most important mechanisms: additive activation attention, multiplicative output attention (output gating), and multiplicative synaptic attention (synaptic gating). The gating mechanisms correspond to multiplicative extensions of the standard model and are used across all current attention-based deep learning architectures. We study their functional properties and estimate the capacity of several attentional building blocks in the case of linear and polynomial threshold gates. Surprisingly, additive activation attention plays a central role in the proofs of the lower bounds. Attention mechanisms reduce the depth of certain basic circuits and leverage the power of quadratic activations without incurring their full cost.

CRSep 30, 2021
Private sampling: a noiseless approach for generating differentially private synthetic data

March Boedihardjo, Thomas Strohmer, Roman Vershynin

In a world where artificial intelligence and data science become omnipresent, data sharing is increasingly locking horns with data-privacy concerns. Differential privacy has emerged as a rigorous framework for protecting individual privacy in a statistical database, while releasing useful statistical information about the database. The standard way to implement differential privacy is to inject a sufficient amount of noise into the data. However, in addition to other limitations of differential privacy, this process of adding noise will affect data accuracy and utility. Another approach to enable privacy in data sharing is based on the concept of synthetic data. The goal of synthetic data is to create an as-realistic-as-possible dataset, one that not only maintains the nuances of the original data, but does so without risk of exposing sensitive information. The combination of differential privacy with synthetic data has been suggested as a best-of-both-worlds solutions. In this work, we propose the first noisefree method to construct differentially private synthetic data; we do this through a mechanism called "private sampling". Using the Boolean cube as benchmark data model, we derive explicit bounds on accuracy and privacy of the constructed synthetic data. The key mathematical tools are hypercontractivity, duality, and empirical processes. A core ingredient of our private sampling mechanism is a rigorous "marginal correction" method, which has the remarkable property that importance reweighting can be utilized to exactly match the marginals of the sample to the marginals of the population.

CRSep 3, 2021
Privacy of synthetic data: a statistical framework

March Boedihardjo, Thomas Strohmer, Roman Vershynin

Privacy-preserving data analysis is emerging as a challenging problem with far-reaching impact. In particular, synthetic data are a promising concept toward solving the aporetic conflict between data privacy and data sharing. Yet, it is known that accurately generating private, synthetic data of certain kinds is NP-hard. We develop a statistical framework for differentially private synthetic data, which enables us to circumvent the computational hardness of the problem. We consider the true data as a random sample drawn from a population Omega according to some unknown density. We then replace Omega by a much smaller random subset Omega^*, which we sample according to some known density. We generate synthetic data on the reduced space Omega^* by fitting the specified linear statistics obtained from the true data. To ensure privacy we use the common Laplacian mechanism. Employing the concept of Renyi condition number, which measures how well the sampling distribution is correlated with the population distribution, we derive explicit bounds on the privacy and accuracy provided by the proposed method.

CRJul 13, 2021
Covariance's Loss is Privacy's Gain: Computationally Efficient, Private and Accurate Synthetic Data

March Boedihardjo, Thomas Strohmer, Roman Vershynin

The protection of private information is of vital importance in data-driven research, business, and government. The conflict between privacy and utility has triggered intensive research in the computer science and statistics communities, who have developed a variety of methods for privacy-preserving data release. Among the main concepts that have emerged are anonymity and differential privacy. Today, another solution is gaining traction, synthetic data. However, the road to privacy is paved with NP-hard problems. In this paper we focus on the NP-hard challenge to develop a synthetic data generation method that is computationally efficient, comes with provable privacy guarantees, and rigorously quantifies data utility. We solve a relaxed version of this problem by studying a fundamental, but a first glance completely unrelated, problem in probability concerning the concept of covariance loss. Namely, we find a nearly optimal and constructive answer to the question how much information is lost when we take conditional expectation. Surprisingly, this excursion into theoretical probability produces mathematical techniques that allow us to derive constructive, approximately optimal solutions to difficult applied problems concerning microaggregation, privacy, and synthetic data.

LGFeb 19, 2021
A theory of capacity and sparse neural encoding

Pierre Baldi, Roman Vershynin

Motivated by biological considerations, we study sparse neural maps from an input layer to a target layer with sparse activity, and specifically the problem of storing $K$ input-target associations $(x,y)$, or memories, when the target vectors $y$ are sparse. We mathematically prove that $K$ undergoes a phase transition and that in general, and somewhat paradoxically, sparsity in the target layers increases the storage capacity of the map. The target vectors can be chosen arbitrarily, including in random fashion, and the memories can be both encoded and decoded by networks trained using local learning rules, including the simple Hebb rule. These results are robust under a variety of statistical assumptions on the data. The proofs rely on elegant properties of random polytopes and sub-gaussian random vector variables. Open problems and connections to capacity theories and polynomial threshold maps are discussed.

LGJan 20, 2020
Memory capacity of neural networks with threshold and ReLU activations

Roman Vershynin

Overwhelming theoretical and empirical evidence shows that mildly overparametrized neural networks -- those with more connections than the size of the training data -- are often able to memorize the training data with $100\%$ accuracy. This was rigorously proved for networks with sigmoid activation functions and, very recently, for ReLU activations. Addressing a 1988 open question of Baum, we prove that this phenomenon holds for general multilayered perceptrons, i.e. neural networks with threshold activation functions, or with any mix of threshold and ReLU activations. Our construction is probabilistic and exploits sparsity.

MLOct 28, 2019
Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval

Yan Shuo Tan, Roman Vershynin

In recent literature, a general two step procedure has been formulated for solving the problem of phase retrieval. First, a spectral technique is used to obtain a constant-error initial estimate, following which, the estimate is refined to arbitrary precision by first-order optimization of a non-convex loss function. Numerical experiments, however, seem to suggest that simply running the iterative schemes from a random initialization may also lead to convergence, albeit at the cost of slightly higher sample complexity. In this paper, we prove that, in fact, constant step size online stochastic gradient descent (SGD) converges from arbitrary initializations for the non-smooth, non-convex amplitude squared loss objective. In this setting, online SGD is also equivalent to the randomized Kaczmarz algorithm from numerical analysis. Our analysis can easily be generalized to other single index models. It also makes use of new ideas from stochastic process theory, including the notion of a summary state space, which we believe will be of use for the broader field of non-convex optimization.

LGJan 2, 2019
The capacity of feedforward neural networks

Pierre Baldi, Roman Vershynin

A long standing open problem in the theory of neural networks is the development of quantitative methods to estimate and compare the capabilities of different architectures. Here we define the capacity of an architecture by the binary logarithm of the number of functions it can compute, as the synaptic weights are varied. The capacity provides an upper bound on the number of bits that can be extracted from the training data and stored in the architecture during learning. We study the capacity of layered, fully-connected, architectures of linear threshold neurons with $L$ layers of size $n_1,n_2, \ldots, n_L$ and show that in essence the capacity is given by a cubic polynomial in the layer sizes: $C(n_1,\ldots, n_L)=\sum_{k=1}^{L-1} \min(n_1,\ldots,n_k)n_kn_{k+1}$, where layers that are smaller than all previous layers act as bottlenecks. In proving the main result, we also develop new techniques (multiplexing, enrichment, and stacking) as well as new bounds on the capacity of finite sets. We use the main result to identify architectures with maximal or minimal capacity under a number of natural constraints. This leads to the notion of structural regularization for deep architectures. While in general, everything else being equal, shallow networks compute more functions than deep networks, the functions computed by deep networks are more regular and "interesting".

NAJun 30, 2017
Phase Retrieval via Randomized Kaczmarz: Theoretical Guarantees

Yan Shuo Tan, Roman Vershynin

We consider the problem of phase retrieval, i.e. that of solving systems of quadratic equations. A simple variant of the randomized Kaczmarz method was recently proposed for phase retrieval, and it was shown numerically to have a computational edge over state-of-the-art Wirtinger flow methods. In this paper, we provide the first theoretical guarantee for the convergence of the randomized Kaczmarz method for phase retrieval. We show that it is sufficient to have as many Gaussian measurements as the dimension, up to a constant factor. Along the way, we introduce a sufficient condition on measurement sets for which the randomized Kaczmarz method is guaranteed to work. We show that Gaussian sampling vectors satisfy this property with high probability; this is proved using a chaining argument coupled with bounds on VC dimension and metric entropy.

LGApr 4, 2017
Polynomial Time and Sample Complexity for Non-Gaussian Component Analysis: Spectral Methods

Yan Shuo Tan, Roman Vershynin

The problem of Non-Gaussian Component Analysis (NGCA) is about finding a maximal low-dimensional subspace $E$ in $\mathbb{R}^n$ so that data points projected onto $E$ follow a non-gaussian distribution. Although this is an appropriate model for some real world data analysis problems, there has been little progress on this problem over the last decade. In this paper, we attempt to address this state of affairs in two ways. First, we give a new characterization of standard gaussian distributions in high-dimensions, which lead to effective tests for non-gaussianness. Second, we propose a simple algorithm, \emph{Reweighted PCA}, as a method for solving the NGCA problem. We prove that for a general unknown non-gaussian distribution, this algorithm recovers at least one direction in $E$, with sample and time complexity depending polynomially on the dimension of the ambient space. We conjecture that the algorithm actually recovers the entire $E$.

ITJul 27, 2015
On the Effective Measure of Dimension in the Analysis Cosparse Model

Raja Giryes, Yaniv Plan, Roman Vershynin

Many applications have benefited remarkably from low-dimensional models in the recent decade. The fact that many signals, though high dimensional, are intrinsically low dimensional has given the possibility to recover them stably from a relatively small number of their measurements. For example, in compressed sensing with the standard (synthesis) sparsity prior and in matrix completion, the number of measurements needed is proportional (up to a logarithmic factor) to the signal's manifold dimension. Recently, a new natural low-dimensional signal model has been proposed: the cosparse analysis prior. In the noiseless case, it is possible to recover signals from this model, using a combinatorial search, from a number of measurements proportional to the signal's manifold dimension. However, if we ask for stability to noise or an efficient (polynomial complexity) solver, all the existing results demand a number of measurements which is far removed from the manifold dimension, sometimes far greater. Thus, it is natural to ask whether this gap is a deficiency of the theory and the solvers, or if there exists a real barrier in recovering the cosparse signals by relying only on their manifold dimension. Is there an algorithm which, in the presence of noise, can accurately recover a cosparse signal from a number of measurements proportional to the manifold dimension? In this work, we prove that there is no such algorithm. Further, we show through numerical simulations that even in the noiseless case convex relaxations fail when the number of measurements is comparable to the manifold dimension. This gives a practical counter-example to the growing literature on compressed acquisition of signals based on manifold dimension.

MLMay 31, 2014
Optimization via Low-rank Approximation for Community Detection in Networks

Can M. Le, Elizaveta Levina, Roman Vershynin

Community detection is one of the fundamental problems of network analysis, for which a number of methods have been proposed. Most model-based or criteria-based methods have to solve an optimization problem over a discrete set of labels to find communities, which is computationally infeasible. Some fast spectral algorithms have been proposed for specific methods or models, but only on a case-by-case basis. Here we propose a general approach for maximizing a function of a network adjacency matrix over discrete labels by projecting the set of labels onto a subspace approximating the leading eigenvectors of the expected adjacency matrix. This projection onto a low-dimensional space makes the feasible set of labels much smaller and the optimization problem much easier. We prove a general result about this method and show how to apply it to several previously proposed community detection criteria, establishing its consistency for label estimation in each case and demonstrating the fundamental connection between spectral properties of the network and various model-based approaches to community detection. Simulations and applications to real-world data are included to demonstrate our method performs well for multiple problems over a wide range of parameters.

NAApr 6, 2010
Uncertainty Principles and Vector Quantization

Yurii Lyubarskii, Roman Vershynin

Given a frame in C^n which satisfies a form of the uncertainty principle (as introduced by Candes and Tao), it is shown how to quickly convert the frame representation of every vector into a more robust Kashin's representation whose coefficients all have the smallest possible dynamic range O(1/\sqrt{n}). The information tends to spread evenly among these coefficients. As a consequence, Kashin's representations have a great power for reduction of errors in their coefficients, including coefficient losses and distortions.

NAAug 3, 2009
On the Role of Sparsity in Compressed Sensing and Random Matrix Theory

Roman Vershynin

We discuss applications of some concepts of Compressed Sensing in the recent work on invertibility of random matrices due to Rudelson and the author. We sketch an argument leading to the optimal bound N^{-1/2} on the median of the smallest singular value of an N by N matrix with random independent entries. We highlight the parts of the argument where sparsity ideas played a key role.

NAFeb 8, 2007
A randomized Kaczmarz algorithm with exponential convergence

Thomas Strohmer, Roman Vershynin

The Kaczmarz method for solving linear systems of equations is an iterative algorithm that has found many applications ranging from computer tomography to digital signal processing. Despite the popularity of this method, useful theoretical estimates for its rate of convergence are still scarce. We introduce a randomized version of the Kaczmarz method for consistent, overdetermined linear systems and we prove that it converges with expected exponential rate. Furthermore, this is the first solver whose rate does not depend on the number of equations in the system. The solver does not even need to know the whole system, but only a small random part of it. It thus outperforms all previously known methods on general extremely overdetermined systems. Even for moderately overdetermined systems, numerical simulations as well as theoretical analysis reveal that our algorithm can converge faster than the celebrated conjugate gradient algorithm. Furthermore, our theory and numerical simulations confirm a prediction of Feichtinger et al. in the context of reconstructing bandlimited functions from nonuniform sampling.

FADec 22, 2006
Sampling from large matrices: an approach through geometric functional analysis

Mark Rudelson, Roman Vershynin

We study random submatrices of a large matrix A. We show how to approximately compute A from its random submatrix of the smallest possible size O(r log r) with a small error in the spectral norm, where r = ||A||_F^2 / ||A||_2^2 is the numerical rank of A. The numerical rank is always bounded by, and is a stable relaxation of, the rank of A. This yields an asymptotically optimal guarantee in an algorithm for computing low-rank approximations of A. We also prove asymptotically optimal estimates on the spectral norm and the cut-norm of random submatrices of A. The result for the cut-norm yields a slight improvement on the best known sample complexity for an approximation algorithm for MAX-2CSP problems. We use methods of Probability in Banach spaces, in particular the law of large numbers for operator-valued random variables.

NAMar 2, 2006
Sparse reconstruction by convex relaxation: Fourier and Gaussian measurements

Mark Rudelson, Roman Vershynin

We want to exactly reconstruct a sparse signal f (a vector in R^n of small support) from few linear measurements of f (inner products with some fixed vectors). A nice and intuitive reconstruction by Linear Programming has been advocated since 80-ies by Dave Donoho and his collaborators. Namely, one can relax the reconstruction problem, which is highly nonconvex, to a convex problem -- and, moreover, to a linear program. However, when is exactly the reconstruction problem equivalent to its convex relaxation is an open question. Recent work of many authors shows that the number of measurements k(r,n) needed to exactly reconstruct any r-sparse signal f of length n (a vector in R^n of support r) from its linear measurements with the convex relaxation method is usually O(r polylog(n)). However, known estimates of the number of measurements k(r,n) involve huge constants, in spite of very good performance of the algorithms in practice. In this paper, we consider random Gaussian measurements and random Fourier measurements (a frequency sample of f). For Gaussian measurements, we prove the first guarantees with reasonable constants: k(r,n) < 12 r (2 + log(n/r)), which is optimal up to constants. For Fourier measurements, we prove the best known bound k(r,n) = O(r log(n) . log^2(r) log(r log n)), which is optimal within the log log n and log^3 r factors. Our arguments are based on the technique of Geometric Functional Analysis and Probability in Banach spaces.