Danil Akhtiamov

LG
h-index5
9papers
21citations
Novelty40%
AI Score44

9 Papers

LGFeb 18, 2023
The Generalization Error of Stochastic Mirror Descent on Over-Parametrized Linear Models

Danil Akhtiamov, Babak Hassibi

Despite being highly over-parametrized, and having the ability to fully interpolate the training data, deep networks are known to generalize well to unseen data. It is now understood that part of the reason for this is that the training algorithms used have certain implicit regularization properties that ensure interpolating solutions with "good" properties are found. This is best understood in linear over-parametrized models where it has been shown that the celebrated stochastic gradient descent (SGD) algorithm finds an interpolating solution that is closest in Euclidean distance to the initial weight vector. Different regularizers, replacing Euclidean distance with Bregman divergence, can be obtained if we replace SGD with stochastic mirror descent (SMD). Empirical observations have shown that in the deep network setting, SMD achieves a generalization performance that is different from that of SGD (and which depends on the choice of SMD's potential function. In an attempt to begin to understand this behavior, we obtain the generalization error of SMD for over-parametrized linear models for a binary classification problem where the two classes are drawn from a Gaussian mixture model. We present simulation results that validate the theory and, in particular, introduce two data models, one for which SMD with an $\ell_2$ regularizer (i.e., SGD) outperforms SMD with an $\ell_1$ regularizer, and one for which the reverse happens.

MLMar 11
Dual Space Preconditioning for Gradient Descent in the Overparameterized Regime

Reza Ghane, Danil Akhtiamov, Babak Hassibi

In this work we study the convergence properties of the Dual Space Preconditioned Gradient Descent, encompassing optimizers such as Normalized Gradient Descent, Gradient Clipping and Adam. We consider preconditioners of the form $\nabla K$, where $K: \mathbb{R}^p \to \mathbb{R}$ is convex and assume that the latter is applied to train an over-parameterized linear model with loss of the form $\ell({X} {W} - {Y})$, for weights ${W} \in \mathbb{R}^{d \times k}$, labels ${Y} \in \mathbb{R}^{n \times k}$ and data ${X} \in \mathbb{R}^{n \times d}$. Under the aforementioned assumptions, we prove that the iterates of the preconditioned gradient descent always converge to a point ${W}_{\infty} \in \mathbb{R}^{d \times k}$ satisfying ${X}{W}_{\infty} = {Y}$. Our proof techniques are of independent interest as we introduce a novel version of the Bregman Divergence with accompanying identities that allow us to establish convergence. We also study the implicit bias of Dual Space Preconditioned Gradient Descent. First, we demonstrate empirically that, for general $K(\cdot)$, ${W}_\infty$ depends on the chosen learning rate, hindering a precise characterization of the implicit bias. Then, for preconditioners of the form $K({G}) = h(\|{G}\|_F)$, known as \textit{isotropic preconditioners}, we show that ${W}_\infty$ minimizes $\|{W}_\infty - {W}_0\|_F^2$ subject to ${X}{W}_\infty = {Y}$, where ${W}_0$ is the initialization. Denoting the convergence point of GD initialized at ${W}_0$ by ${W}_{\text{GD}, \infty}$, we thus note ${W}_{\infty} = {W}_{\text{GD}, \infty}$ for isotropic preconditioners. Finally, we show that a similar fact holds for general preconditioners up to a multiplicative constant, namely, $\|{W}_0 - {W}_{\infty}\|_F \le c \|{W}_0 - {W}_{\text{GD}, \infty}\|_F$ for a constant $c>0$.

LGNov 3, 2023
Regularized Linear Regression for Binary Classification

Danil Akhtiamov, Reza Ghane, Babak Hassibi

Regularized linear regression is a promising approach for binary classification problems in which the training set has noisy labels since the regularization term can help to avoid interpolating the mislabeled data points. In this paper we provide a systematic study of the effects of the regularization strength on the performance of linear classifiers that are trained to solve binary classification problems by minimizing a regularized least-squares objective. We consider the over-parametrized regime and assume that the classes are generated from a Gaussian Mixture Model (GMM) where a fraction $c<\frac{1}{2}$ of the training data is mislabeled. Under these assumptions, we rigorously analyze the classification errors resulting from the application of ridge, $\ell_1$, and $\ell_\infty$ regression. In particular, we demonstrate that ridge regression invariably improves the classification error. We prove that $\ell_1$ regularization induces sparsity and observe that in many cases one can sparsify the solution by up to two orders of magnitude without any considerable loss of performance, even though the GMM has no underlying sparsity structure. For $\ell_\infty$ regularization we show that, for large enough regularization strength, the optimal weights concentrate around two values of opposite sign. We observe that in many cases the corresponding "compression" of each weight to a single bit leads to very little loss in performance. These latter observations can have significant practical ramifications.

MLFeb 22
Implicit Bias and Convergence of Matrix Stochastic Mirror Descent

Danil Akhtiamov, Reza Ghane, Babak Hassibi

We investigate Stochastic Mirror Descent (SMD) with matrix parameters and vector-valued predictions, a framework relevant to multi-class classification and matrix completion problems. Focusing on the overparameterized regime, where the total number of parameters exceeds the number of training samples, we prove that SMD with matrix mirror functions $ψ(\cdot)$ converges exponentially to a global interpolator. Furthermore, we generalize classical implicit bias results of vector SMD by demonstrating that the matrix SMD algorithm converges to the unique solution minimizing the Bregman divergence induced by $ψ(\cdot)$ from initialization subject to interpolating the data. These findings reveal how matrix mirror maps dictate inductive bias in high-dimensional, multi-output problems.

MLMar 19
Precise Performance of Linear Denoisers in the Proportional Regime

Reza Ghane, Danil Akhtiamov, Babak Hassibi

In the present paper we study the performance of linear denoisers for noisy data of the form $\mathbf{x} + \mathbf{z}$, where $\mathbf{x} \in \mathbb{R}^d$ is the desired data with zero mean and unknown covariance $\mathbfΣ$, and $\mathbf{z} \sim \mathcal{N}(0, \mathbfΣ_{\mathbf{z}})$ is additive noise. Since the covariance $\mathbfΣ$ is not known, the standard Wiener filter cannot be employed for denoising. Instead we assume we are given samples $\mathbf{x}_1,\dots,\mathbf{x}_n \in \mathbb{R}^d$ from the true distribution. A standard approach would then be to estimate $\mathbfΣ$ from the samples and use it to construct an ``empirical" Wiener filter. However, in this paper, motivated by the denoising step in diffusion models, we take a different approach whereby we train a linear denoiser $\mathbf{W}$ from the data itself. In particular, we synthetically construct noisy samples $\hat{\mathbf{x}}_i$ of the data by injecting the samples with Gaussian noise with covariance $\mathbfΣ_1 \neq \mathbfΣ_{\mathbf{z}}$ and find the best $\mathbf{W}$ that approximates $\mathbf{W}\hat{\mathbf{x}}_i \approx \mathbf{x}_i$ in a least-squares sense. In the proportional regime $\frac{n}{d} \rightarrow κ> 1$ we use the {\it Convex Gaussian Min-Max Theorem (CGMT)} to analytically find the closed form expression for the generalization error of the denoiser obtained from this process. Using this expression one can optimize over $\mathbfΣ_1$ to find the best possible denoiser. Our numerical simulations show that our denoiser outperforms the ``empirical" Wiener filter in many scenarios and approaches the optimal Wiener filter as $κ\rightarrow\infty$.

LGFeb 12, 2024
A Novel Gaussian Min-Max Theorem and its Applications

Danil Akhtiamov, David Bosch, Reza Ghane et al.

A celebrated result by Gordon allows one to compare the min-max behavior of two Gaussian processes if certain inequality conditions are met. The consequences of this result include the Gaussian min-max (GMT) and convex Gaussian min-max (CGMT) theorems which have had far-reaching implications in high-dimensional statistics, machine learning, non-smooth optimization, and signal processing. Both theorems rely on a pair of Gaussian processes, first identified by Slepian, that satisfy Gordon's comparison inequalities. In this paper, we identify such a new pair. The resulting theorems extend the classical GMT and CGMT Theorems from the case where the underlying Gaussian matrix in the primary process has iid rows to where it has independent but non-identically-distributed ones. The new CGMT is applied to the problems of multi-source Gaussian regression, as well as to binary classification of general Gaussian mixture models.

LGFeb 16, 2024
One-Bit Quantization and Sparsification for Multiclass Linear Classification with Strong Regularization

Reza Ghane, Danil Akhtiamov, Babak Hassibi

We study the use of linear regression for multiclass classification in the over-parametrized regime where some of the training data is mislabeled. In such scenarios it is necessary to add an explicit regularization term, $λf(w)$, for some convex function $f(\cdot)$, to avoid overfitting the mislabeled data. In our analysis, we assume that the data is sampled from a Gaussian Mixture Model with equal class sizes, and that a proportion $c$ of the training labels is corrupted for each class. Under these assumptions, we prove that the best classification performance is achieved when $f(\cdot) = \|\cdot\|^2_2$ and $λ\to \infty$. We then proceed to analyze the classification errors for $f(\cdot) = \|\cdot\|_1$ and $f(\cdot) = \|\cdot\|_\infty$ in the large $λ$ regime and notice that it is often possible to find sparse and one-bit solutions, respectively, that perform almost as well as the one corresponding to $f(\cdot) = \|\cdot\|_2^2$.

LGOct 17, 2025
One-Bit Quantization for Random Features Models

Danil Akhtiamov, Reza Ghane, Babak Hassibi

Recent advances in neural networks have led to significant computational and memory demands, spurring interest in one-bit weight compression to enable efficient inference on resource-constrained devices. However, the theoretical underpinnings of such compression remain poorly understood. We address this gap by analyzing one-bit quantization in the Random Features model, a simplified framework that corresponds to neural networks with random representations. We prove that, asymptotically, quantizing weights of all layers except the last incurs no loss in generalization error, compared to the full precision random features model. Our findings offer theoretical insights into neural network compression. We also demonstrate empirically that one-bit quantization leads to significant inference speed ups for the Random Features models even on a laptop GPU, confirming the practical benefits of our work. Additionally, we provide an asymptotically precise characterization of the generalization error for Random Features with an arbitrary number of layers. To the best of our knowledge, our analysis yields more general results than all previous works in the related literature.

MLJan 13, 2025
Gaussian Universality for Diffusion Models

Reza Ghane, Anthony Bao, Danil Akhtiamov et al.

We investigate Gaussian Universality for data distributions generated via diffusion models. By Gaussian Universality we mean that the test error of a generalized linear model $f(\mathbf{W})$ trained for a classification task on the diffusion data matches the test error of $f(\mathbf{W})$ trained on the Gaussian Mixture with matching means and covariances per class.In other words, the test error depends only on the first and second order statistics of the diffusion-generated data in the linear setting. As a corollary, the analysis of the test error for linear classifiers can be reduced to Gaussian data from diffusion-generated data. Analysing the performance of models trained on synthetic data is a pertinent problem due to the surge of methods such as \cite{sehwag2024stretchingdollardiffusiontraining}. Moreover, we show that, for any $1$- Lipschitz scalar function $φ$, $φ(\mathbf{x})$ is close to $\mathbb{E} φ(\mathbf{x})$ with high probability for $\mathbf{x}$ sampled from the conditional diffusion model corresponding to each class. Finally, we note that current approaches for proving universality do not apply to diffusion-generated data as the covariance matrices of the data tend to have vanishing minimum singular values, contrary to the assumption made in the literature. This leaves extending previous mathematical universality results as an intriguing open question.