ITJul 22, 2011
Recovering Compressively Sampled Signals Using Partial Support InformationMichael P. Friedlander, Hassan Mansour, Rayan Saab et al.
In this paper we study recovery conditions of weighted $\ell_1$ minimization for signal reconstruction from compressed sensing measurements when partial support information is available. We show that if at least 50% of the (partial) support information is accurate, then weighted $\ell_1$ minimization is stable and robust under weaker conditions than the analogous conditions for standard $\ell_1$ minimization. Moreover, weighted $\ell_1$ minimization provides better bounds on the reconstruction error in terms of the measurement noise and the compressibility of the signal to be recovered. We illustrate our results with extensive numerical experiments on synthetic data and real audio and video signals.
74.3LGMay 31
GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank AdaptationShihao Zhang, Rayan Saab
Post-training quantization is widely used for compressing large neural networks, but aggressive low-bit quantization can significantly degrade model quality. A common remedy is to augment the quantized weights with a low-rank correction, leading to approximations of the form $W\approx Q+LR$. In this paper, we study this low-precision plus low-rank representation through the layer-wise reconstruction objective $\|XW-X(Q+LR)\|_F^2$, where $X$ is a calibration matrix. We establish, to our knowledge, the first information-theoretic lower bounds for this problem under finite-alphabet and bounded low-rank compensation constraints. We then propose GPTQ-intrinsic LoRA, a training-free algorithm that incorporates the low-rank correction directly into a GPTQ-style quantization pass by appropriately augmenting the calibration Hessian. For the choice $L=V_r$, where $V_r$ contains the top right singular vectors of $X$, we prove layer-wise reconstruction error bounds in which the usual GPTQ dependence on $\|X\|_F^2$ is replaced by the rank-$r$ residual $\|X-X_r\|_F^2$, up to regularization terms. Under natural structural assumptions, these bounds match the information-theoretic lower bounds in their dominant scaling, up to constants and mild factors. We also introduce Bid-Up, a fixed-grid quantization refinement step that can be alternated with optimal low-rank compensation with guaranteed non-increasing layer-wise reconstruction error. Experiments on Qwen3 language models and DeiT vision transformers show that GPTQ-intrinsic LoRA improves over GPTQ and GPTQ followed by low-rank compensation, with additional gains from refinement loops.
NADec 6, 2016
Phase Retrieval from Local Measurements: Improved Robustness via Eigenvector-Based Angular SynchronizationMark A. Iwen, Brian Preskitt, Rayan Saab et al.
We improve a phase retrieval approach that uses correlation-based measurements with compactly supported measurement masks [27]. The improved algorithm admits deterministic measurement constructions together with a robust, fast recovery algorithm that consists of solving a system of linear equations in a lifted space, followed by finding an eigenvector (e.g., via an inverse power iteration). Theoretical reconstruction error guarantees from [27] are improved as a result for the new and more robust reconstruction approach proposed herein. Numerical experiments demonstrate robustness and computational efficiency that outperforms competing approaches on large problems. Finally, we show that this approach also trivially extends to phase retrieval problems based on windowed Fourier measurements.
ITFeb 15, 2017
Quantized Compressed Sensing for Partial Random Circulant MatricesJoe-Mei Feng, Felix Krahmer, Rayan Saab
We provide the first analysis of a non-trivial quantization scheme for compressed sensing measurements arising from structured measurements. Specifically, our analysis studies compressed sensing matrices consisting of rows selected at random, without replacement, from a circulant matrix generated by a random subgaussian vector. We quantize the measurements using stable, possibly one-bit, Sigma-Delta schemes, and use a reconstruction method based on convex optimization. We show that the part of the reconstruction error due to quantization decays polynomially in the number of measurements. This is in line with analogous results on Sigma-Delta quantization associated with random Gaussian or subgaussian matrices, and significantly better than results associated with the widely assumed memoryless scalar quantization. Moreover, we prove that our approach is stable and robust; i.e., the reconstruction error degrades gracefully in the presence of non-quantization noise and when the underlying signal is not strictly sparse. The analysis relies on results concerning subgaussian chaos processes as well as a variation of McDiarmid's inequality.
LGSep 7, 2022
A simple approach for quantizing neural networksJohannes Maly, Rayan Saab
In this short note, we propose a new method for quantizing the weights of a fully trained neural network. A simple deterministic pre-processing step allows us to quantize network layers via memoryless scalar quantization while preserving the network performance on given training data. On one hand, the computational complexity of this pre-processing slightly exceeds that of state-of-the-art algorithms in the literature. On the other hand, our approach does not require any hyper-parameter tuning and, in contrast to previous methods, allows a plain analysis. We provide rigorous theoretical guarantees in the case of quantizing single network layers and show that the relative error decays with the number of parameters in the network if the training data behaves well, e.g., if it is sampled from suitable random distributions. The developed method also readily allows the quantization of deep networks by consecutive application to single layers.
LGSep 25, 2024
Accumulator-Aware Post-Training Quantization for Large Language ModelsIan Colbert, Giuseppe Franco, Fabian Grob et al.
When quantizing weights and activations to increasingly narrower representations, the cost of additions begins to dominate that of multiplications in multiply-accumulate (MAC) units. Recent studies show that reducing addition costs via low-precision accumulation improves throughput, power, and area across inference platforms, albeit with an increased risk of overflow. Accumulator-aware quantization research has so far only considered the quantization-aware training (QAT) paradigm, in which models are fine-tuned or trained from scratch with quantization in the loop. As models and datasets continue to grow in size, QAT techniques become increasingly more expensive, which has motivated the recent surge in post-training quantization (PTQ) research. To bridge this gap, we introduce AXE, the first accumulator-aware quantization framework explicitly designed to endow overflow avoidance guarantees to PTQ algorithms. We present theoretical motivation for AXE and demonstrate its flexibility by implementing it on top of two existing algorithms: GPFQ and OPTQ. We design AXE to support multi-stage accumulation, opening the door to full datapath optimization for the first time. We evaluate AXE using recent language generation models; when quantizing Llama3 8B for a 16-bit multi-stage accumulation datapath, AXE maintains up to 98% of the FP16 perplexity, surpassing naive bit width manipulation by up to 15%.
LGSep 20, 2023
SPFQ: A Stochastic Algorithm and Its Error Analysis for Neural Network QuantizationJinjie Zhang, Rayan Saab
Quantization is a widely used compression method that effectively reduces redundancies in over-parameterized neural networks. However, existing quantization techniques for deep neural networks often lack a comprehensive error analysis due to the presence of non-convex loss functions and nonlinear activations. In this paper, we propose a fast stochastic algorithm for quantizing the weights of fully trained neural networks. Our approach leverages a greedy path-following mechanism in combination with a stochastic quantizer. Its computational complexity scales only linearly with the number of weights in the network, thereby enabling the efficient quantization of large networks. Importantly, we establish, for the first time, full-network error bounds, under an infinite alphabet condition and minimal assumptions on the weights and input data. As an application of this result, we prove that when quantizing a multi-layer network having Gaussian weights, the relative square quantization error exhibits a linear decay as the degree of over-parametrization increases. Furthermore, we demonstrate that it is possible to achieve error bounds equivalent to those obtained in the infinite alphabet case, using on the order of a mere $\log\log N$ bits per weight, where $N$ represents the largest number of neurons in a layer.
LGAug 6, 2025
Provable Post-Training Quantization: Theoretical Analysis of OPTQ and QronosHaoyu Zhang, Shihao Zhang, Ian Colbert et al.
Post-training quantization (PTQ) has become a crucial tool for reducing the memory and compute costs of modern deep neural networks, including large language models (LLMs). Among PTQ algorithms, the OPTQ framework-also known as GPTQ-has emerged as a leading method due to its computational efficiency and strong empirical performance. Despite its widespread adoption, however, OPTQ lacks rigorous quantitative theoretical guarantees. This paper presents the first quantitative error bounds for both deterministic and stochastic variants of OPTQ, as well as for Qronos, a recent related state-of-the-art PTQ algorithm. We analyze how OPTQ's iterative procedure induces quantization error and derive non-asymptotic 2-norm error bounds that depend explicitly on the calibration data and a regularization parameter that OPTQ uses. Our analysis provides theoretical justification for several practical design choices, including the widely used heuristic of ordering features by decreasing norm, as well as guidance for selecting the regularization parameter. For the stochastic variant, we establish stronger infinity-norm error bounds, which enable control over the required quantization alphabet and are particularly useful for downstream layers and nonlinearities. Finally, we extend our analysis to Qronos, providing new theoretical bounds, for both its deterministic and stochastic variants, that help explain its empirical advantages.
LGFeb 4, 2025
Theoretical Guarantees for Low-Rank Compression of Deep Neural NetworksShihao Zhang, Rayan Saab
Deep neural networks have achieved state-of-the-art performance across numerous applications, but their high memory and computational demands present significant challenges, particularly in resource-constrained environments. Model compression techniques, such as low-rank approximation, offer a promising solution by reducing the size and complexity of these networks while only minimally sacrificing accuracy. In this paper, we develop an analytical framework for data-driven post-training low-rank compression. We prove three recovery theorems under progressively weaker assumptions about the approximate low-rank structure of activations, modeling deviations via noise. Our results represent a step toward explaining why data-driven low-rank compression methods outperform data-agnostic approaches and towards theoretically grounded compression algorithms that reduce inference costs while maintaining performance.
LGDec 24, 2024
Unified Stochastic Framework for Neural Network Quantization and PruningHaoyu Zhang, Rayan Saab
Quantization and pruning are two essential techniques for compressing neural networks, yet they are often treated independently, with limited theoretical analysis connecting them. This paper introduces a unified framework for post-training quantization and pruning using stochastic path-following algorithms. Our approach builds on the Stochastic Path Following Quantization (SPFQ) method, extending its applicability to pruning and low-bit quantization, including challenging 1-bit regimes. By incorporating a scaling parameter and generalizing the stochastic operator, the proposed method achieves robust error correction and yields rigorous theoretical error bounds for both quantization and pruning as well as their combination.
LGSep 6, 2025
The Measure of Deception: An Analysis of Data Forging in Machine UnlearningRishabh Dixit, Yuan Hui, Rayan Saab
Motivated by privacy regulations and the need to mitigate the effects of harmful data, machine unlearning seeks to modify trained models so that they effectively ``forget'' designated data. A key challenge in verifying unlearning is forging -- adversarially crafting data that mimics the gradient of a target point, thereby creating the appearance of unlearning without actually removing information. To capture this phenomenon, we consider the collection of data points whose gradients approximate a target gradient within tolerance $ε$ -- which we call an $ε$-forging set -- and develop a framework for its analysis. For linear regression and one-layer neural networks, we show that the Lebesgue measure of this set is small. It scales on the order of $ε$, and when $ε$ is small enough, $ε^d$. More generally, under mild regularity assumptions, we prove that the forging set measure decays as $ε^{(d-r)/2}$, where $d$ is the data dimension and $r<d$ is the nullity of a variation matrix defined by the model gradients. Extensions to batch SGD and almost-everywhere smooth loss functions yield the same asymptotic scaling. In addition, we establish probability bounds showing that, under non-degenerate data distributions, the likelihood of randomly sampling a forging point is vanishingly small. These results provide evidence that adversarial forging is fundamentally limited and that false unlearning claims can, in principle, be detected.
LGAug 27, 2025
Beacon: Post-Training Quantization with Integrated Grid SelectionShihao Zhang, Rayan Saab
Quantization is a widely used compression technique for reducing the memory and computation costs of large pre-trained models. A key challenge in per-channel post-training quantization (PTQ) is selecting appropriate scaling factors to replace weight values with values from a scaled integer grid. Existing methods typically fix the scale at the outset via heuristic tuning or grid search. We propose Beacon, a simple and effective algorithm that eliminates the need for such manual tuning. Beacon performs per-channel PTQ directly using an unscaled grid and automatically determines the optimal scaling factors by exploiting the geometry of scalar quantization. It does not rely on back-propagation or large calibration sets. Despite its simplicity and tuning-free nature, Beacon achieves competitive performance compared to state-of-the-art methods, making it a practical solution for efficient model deployment.
SPFeb 9, 2022
Spectrally Adaptive Common Spatial PatternsMahta Mousavi, Eric Lybrand, Shuangquan Feng et al.
The method of Common Spatial Patterns (CSP) is widely used for feature extraction of electroencephalography (EEG) data, such as in motor imagery brain-computer interface (BCI) systems. It is a data-driven method estimating a set of spatial filters so that the power of the filtered EEG signal is maximized for one motor imagery class and minimized for the other. This method, however, is prone to overfitting and is known to suffer from poor generalization especially with limited calibration data. Additionally, due to the high heterogeneity in brain data and the non-stationarity of brain activity, CSP is usually trained for each user separately resulting in long calibration sessions or frequent re-calibrations that are tiring for the user. In this work, we propose a novel algorithm called Spectrally Adaptive Common Spatial Patterns (SACSP) that improves CSP by learning a temporal/spectral filter for each spatial filter so that the spatial filters are concentrated on the most relevant temporal frequencies for each user. We show the efficacy of SACSP in providing better generalizability and higher classification accuracy from calibration to online control compared to existing methods. Furthermore, we show that SACSP provides neurophysiologically relevant information about the temporal frequencies of the filtered signals. Our results highlight the differences in the motor imagery signal among BCI users as well as spectral differences in the signals generated for each class, and show the importance of learning robust user-specific features in a data-driven manner.
LGJan 26, 2022
Post-training Quantization for Neural Networks with Provable GuaranteesJinjie Zhang, Yixuan Zhou, Rayan Saab
While neural networks have been remarkably successful in a wide array of applications, implementing them in resource-constrained hardware remains an area of intense research. By replacing the weights of a neural network with quantized (e.g., 4-bit, or binary) counterparts, massive savings in computation cost, memory, and power consumption are attained. To that end, we generalize a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism. Among other things, we propose modifications to promote sparsity of the weights, and rigorously analyze the associated error. Additionally, our error analysis expands the results of previous work on GPFQ to handle general quantization alphabets, showing that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights -- i.e., level of over-parametrization. Our result holds across a range of input distributions and for both fully-connected and convolutional architectures thereby also extending previous results. To empirically evaluate the method, we quantize several common architectures with few bits per weight, and test them on ImageNet, showing only minor loss of accuracy compared to unquantized models. We also demonstrate that standard modifications, such as bias correction and mixed precision quantization, further improve accuracy.
LGJun 4, 2021
Sigma-Delta and Distributed Noise-Shaping Quantization Methods for Random Fourier FeaturesJinjie Zhang, Harish Kannan, Alexander Cloninger et al.
We propose the use of low bit-depth Sigma-Delta and distributed noise-shaping methods for quantizing the Random Fourier features (RFFs) associated with shift-invariant kernels. We prove that our quantized RFFs -- even in the case of $1$-bit quantization -- allow a high accuracy approximation of the underlying kernels, and the approximation error decays at least polynomially fast as the dimension of the RFFs increases. We also show that the quantized RFFs can be further compressed, yielding an excellent trade-off between memory use and accuracy. Namely, the approximation error now decays exponentially as a function of the bits used. Moreover, we empirically show by testing the performance of our methods on several machine learning tasks that our method compares favorably to other state of the art quantization methods in this context.
LGOct 29, 2020
A Greedy Algorithm for Quantizing Neural NetworksEric Lybrand, Rayan Saab
We propose a new computationally efficient method for quantizing the weights of pre- trained neural networks that is general enough to handle both multi-layer perceptrons and convolutional neural networks. Our method deterministically quantizes layers in an iterative fashion with no complicated re-training required. Specifically, we quantize each neuron, or hidden unit, using a greedy path-following algorithm. This simple algorithm is equivalent to running a dynamical system, which we prove is stable for quantizing a single-layer neural network (or, alternatively, for quantizing the first layer of a multi-layer network) when the training data are Gaussian. We show that under these assumptions, the quantization error decays with the width of the layer, i.e., its level of over-parametrization. We provide numerical experiments, on multi-layer networks, to illustrate the performance of our methods on MNIST and CIFAR10 data, as well as for quantizing the VGG16 network using ImageNet data.
ITOct 1, 2020
Faster Binary Embeddings for Preserving Euclidean DistancesJinjie Zhang, Rayan Saab
We propose a fast, distance-preserving, binary embedding algorithm to transform a high-dimensional dataset $\mathcal{T}\subseteq\mathbb{R}^n$ into binary sequences in the cube $\{\pm 1\}^m$. When $\mathcal{T}$ consists of well-spread (i.e., non-sparse) vectors, our embedding method applies a stable noise-shaping quantization scheme to $A x$ where $A\in\mathbb{R}^{m\times n}$ is a sparse Gaussian random matrix. This contrasts with most binary embedding methods, which usually use $x\mapsto \mathrm{sign}(Ax)$ for the embedding. Moreover, we show that Euclidean distances among the elements of $\mathcal{T}$ are approximated by the $\ell_1$ norm on the images of $\{\pm 1\}^m$ under a fast linear transformation. This again contrasts with standard methods, where the Hamming distance is used instead. Our method is both fast and memory efficient, with time complexity $O(m)$ and space complexity $O(m)$. Further, we prove that the method is accurate and its associated error is comparable to that of a continuous valued Johnson-Lindenstrauss embedding plus a quantization error that admits a polynomial decay as the embedding dimension $m$ increases. Thus the length of the binary codes required to achieve a desired accuracy is quite small, and we show it can even be compressed further without compromising the accuracy. To illustrate our results, we test the proposed method on natural images and show that it achieves strong performance.
MLJul 30, 2020
Random Vector Functional Link Networks for Function Approximation on ManifoldsDeanna Needell, Aaron A. Nelson, Rayan Saab et al.
The learning speed of feed-forward neural networks is notoriously slow and has presented a bottleneck in deep learning applications for several decades. For instance, gradient-based learning algorithms, which are used extensively to train neural networks, tend to work slowly when all of the network parameters must be iteratively tuned. To counter this, both researchers and practitioners have tried introducing randomness to reduce the learning requirement. Based on the original construction of Igelnik and Pao, single layer neural-networks with random input-to-hidden layer weights and biases have seen success in practice, but the necessary theoretical justification is lacking. In this paper, we begin to fill this theoretical gap. We provide a (corrected) rigorous proof that the Igelnik and Pao construction is a universal approximator for continuous functions on compact domains, with approximation error decaying asymptotically like $O(1/\sqrt{n})$ for the number $n$ of network nodes. We then extend this result to the non-asymptotic setting, proving that one can achieve any desired approximation error with high probability provided $n$ is sufficiently large. We further adapt this randomized neural network architecture to approximate functions on smooth, compact submanifolds of Euclidean space, providing theoretical guarantees in both the asymptotic and non-asymptotic forms. Finally, we illustrate our results on manifolds with numerical experiments.
NAApr 16, 2019
A direct solver for the phase retrieval problem in ptychographic imagingNada Sissouno, Florian Boßmann, Frank Filbir et al.
Measurements achieved with ptychographic imaging are a special case of diffraction measurements. They are generated by illuminating small parts of a sample with, e.g., a focused X-ray beam. By shifting the sample, a set of far-field diffraction patterns of the whole sample are then obtained. From a mathematical point of view those measurements are the squared modulus of the windowed Fourier transform of the sample. Thus, we have a phase retrieval problem for local Fourier measurements. A direct solver for this problem was introduced by Iwen, Viswanathan and Wang in 2016 and improved by Iwen, Preskitt, Saab and Viswanathan in 2018. Motivated by the applied perspective of ptychographic imaging, we present a generalization of this method and compare the different versions in numerical experiments. The new method proposed herein turns out to be more stable, particularly in the case of missing data.
ITJan 26, 2018
Fast binary embeddings, and quantized compressed sensing with structured matricesThang Huynh, Rayan Saab
This paper deals with two related problems, namely distance-preserving binary embeddings and quantization for compressed sensing . First, we propose fast methods to replace points from a subset $\mathcal{X} \subset \mathbb{R}^n$, associated with the Euclidean metric, with points in the cube $\{\pm 1\}^m$ and we associate the cube with a pseudo-metric that approximates Euclidean distance among points in $\mathcal{X}$. Our methods rely on quantizing fast Johnson-Lindenstrauss embeddings based on bounded orthonormal systems and partial circulant ensembles, both of which admit fast transforms. Our quantization methods utilize noise-shaping, and include Sigma-Delta schemes and distributed noise-shaping schemes. The resulting approximation errors decay polynomially and exponentially fast in $m$, depending on the embedding method. This dramatically outperforms the current decay rates associated with binary embeddings and Hamming distances. Additionally, it is the first such binary embedding result that applies to fast Johnson-Lindenstrauss maps while preserving $\ell_2$ norms. Second, we again consider noise-shaping schemes, albeit this time to quantize compressed sensing measurements arising from bounded orthonormal ensembles and partial circulant matrices. We show that these methods yield a reconstruction error that again decays with the number of measurements (and bits), when using convex optimization for reconstruction. Specifically, for Sigma-Delta schemes, the error decays polynomially in the number of measurements, and it decays exponentially for distributed noise-shaping schemes based on beta encoding. These results are near optimal and the first of their kind dealing with bounded orthonormal systems.
LGJul 6, 2017
Simple Classification using Binary DataDeanna Needell, Rayan Saab, Tina Woolf
Binary, or one-bit, representations of data arise naturally in many applications, and are appealing in both hardware implementations and algorithm design. In this work, we study the problem of data classification from binary data and propose a framework with low computation and resource costs. We illustrate the utility of the proposed approach through stylized and realistic numerical experiments, and provide a theoretical analysis for a simple case. We hope that our framework and analysis will serve as a foundation for studying similar types of approaches.
MLApr 28, 2014
One-bit compressive sensing with norm estimationKarin Knudson, Rayan Saab, Rachel Ward
Consider the recovery of an unknown signal ${x}$ from quantized linear measurements. In the one-bit compressive sensing setting, one typically assumes that ${x}$ is sparse, and that the measurements are of the form $\operatorname{sign}(\langle {a}_i, {x} \rangle) \in \{\pm1\}$. Since such measurements give no information on the norm of ${x}$, recovery methods from such measurements typically assume that $\| {x} \|_2=1$. We show that if one allows more generally for quantized affine measurements of the form $\operatorname{sign}(\langle {a}_i, {x} \rangle + b_i)$, and if the vectors ${a}_i$ are random, an appropriate choice of the affine shifts $b_i$ allows norm recovery to be easily incorporated into existing methods for one-bit compressive sensing. Additionally, we show that for arbitrary fixed ${x}$ in the annulus $r \leq \| {x} \|_2 \leq R$, one may estimate the norm $\| {x} \|_2$ up to additive error $δ$ from $m \gtrsim R^4 r^{-2} δ^{-2}$ such binary measurements through a single evaluation of the inverse Gaussian error function. Finally, all of our recovery guarantees can be made universal over sparse vectors, in the sense that with high probability, one set of measurements and thresholds can successfully estimate all sparse vectors ${x}$ within a Euclidean ball of known radius.