LGOct 28, 2023
High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed NoiseAleksandar Armacki, Pranay Sharma, Gauri Joshi et al. · cmu
We study high-probability convergence guarantees of learning on streaming data in the presence of heavy-tailed noise. In the proposed scenario, the model is updated in an online fashion, as new information is observed, without storing any additional data. To combat the heavy-tailed noise, we consider a general framework of nonlinear stochastic gradient descent (SGD), providing several strong results. First, for non-convex costs and component-wise nonlinearities, we establish a convergence rate arbitrarily close to $\mathcal{O}\left(t^{-\frac{1}{4}}\right)$, whose exponent is independent of noise and problem parameters. Second, for strongly convex costs and component-wise nonlinearities, we establish a rate arbitrarily close to $\mathcal{O}\left(t^{-\frac{1}{2}}\right)$ for the weighted average of iterates, with exponent again independent of noise and problem parameters. Finally, for strongly convex costs and a broader class of nonlinearities, we establish convergence of the last iterate, with a rate $\mathcal{O}\left(t^{-ζ} \right)$, where $ζ\in (0,1)$ depends on problem parameters, noise and nonlinearity. As we show analytically and numerically, $ζ$ can be used to inform the preferred choice of nonlinearity for given problem settings. Compared to state-of-the-art, who only consider clipping, require bounded noise moments of order $η\in (1,2]$, and establish convergence rates whose exponents go to zero as $η\rightarrow 1$, we provide high-probability guarantees for a much broader class of nonlinearities and symmetric density noise, with convergence rates whose exponents are bounded away from zero, even when the noise has finite first moment only. Moreover, in the case of strongly convex functions, we demonstrate analytically and numerically that clipping is not always the optimal nonlinearity, further underlining the value of our general framework.
LGSep 22, 2022
A One-shot Framework for Distributed Clustered Learning in Heterogeneous EnvironmentsAleksandar Armacki, Dragana Bajovic, Dusan Jakovetic et al. · cmu
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments in which users obtain data from one of $K$ different distributions. In the proposed setup, the grouping of users (based on the data distributions they sample), as well as the underlying statistical properties of the distributions, are apriori unknown. A family of One-shot Distributed Clustered Learning methods (ODCL-$\mathcal{C}$) is proposed, parametrized by the set of admissible clustering algorithms $\mathcal{C}$, with the objective of learning the true model at each user. The admissible clustering methods include $K$-means (KM) and convex clustering (CC), giving rise to various one-shot methods within the proposed family, such as ODCL-KM and ODCL-CC. The proposed one-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees. In particular, for strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error (MSE) rates in terms of the sample size. An explicit characterization of the threshold is provided in terms of problem parameters. The trade-offs with respect to selecting various clustering methods (ODCL-CC, ODCL-KM) are discussed and significant improvements over state-of-the-art are demonstrated. Numerical experiments illustrate the findings and corroborate the performance of the proposed methods.
OCApr 6, 2022
Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noiseDusan Jakovetic, Dragana Bajovic, Anit Kumar Sahu et al.
We introduce a general framework for nonlinear stochastic gradient descent (SGD) for the scenarios when gradient noise exhibits heavy tails. The proposed framework subsumes several popular nonlinearity choices, like clipped, normalized, signed or quantized gradient, but we also consider novel nonlinearity choices. We establish for the considered class of methods strong convergence guarantees assuming a strongly convex cost function with Lipschitz continuous gradients under very general assumptions on the gradient noise. Most notably, we show that, for a nonlinearity with bounded outputs and for the gradient noise that may not have finite moments of order greater than one, the nonlinear SGD's mean squared error (MSE), or equivalently, the expected cost function's optimality gap, converges to zero at rate~$O(1/t^ζ)$, $ζ\in (0,1)$. In contrast, for the same noise setting, the linear SGD generates a sequence with unbounded variances. Furthermore, for the nonlinearities that can be decoupled component wise, like, e.g., sign gradient or component-wise clipping, we show that the nonlinear SGD asymptotically (locally) achieves a $O(1/t)$ rate in the weak convergence sense and explicitly quantify the corresponding asymptotic variance. Experiments show that, while our framework is more general than existing studies of SGD under heavy-tail noise, several easy-to-implement nonlinearities from our framework are competitive with state of the art alternatives on real data sets with heavy tail noises.
LGNov 2, 2022
Large deviations rates for stochastic gradient descent with strongly convex functionsDragana Bajovic, Dusan Jakovetic, Soummya Kar
Recent works have shown that high probability metrics with stochastic gradient descent (SGD) exhibit informativeness and in some cases advantage over the commonly adopted mean-square error-based ones. In this work we provide a formal framework for the study of general high probability bounds with SGD, based on the theory of large deviations. The framework allows for a generic (not-necessarily bounded) gradient noise satisfying mild technical assumptions, allowing for the dependence of the noise distribution on the current iterate. Under the preceding assumptions, we find an upper large deviations bound for SGD with strongly convex functions. The corresponding rate function captures analytical dependence on the noise distribution and other problem parameters. This is in contrast with conventional mean-square error analysis that captures only the noise dependence through the variance and does not capture the effect of higher order moments nor interplay between the noise geometry and the shape of the cost function. We also derive exact large deviation rates for the case when the objective function is quadratic and show that the obtained function matches the one from the general upper bound hence showing the tightness of the general upper bound. Numerical examples illustrate and corroborate theoretical findings.
31.7ITMar 21
Tackling heavy-tailed noise in distributed estimation: Asymptotic performance and tradeoffsDragana Bajovic, Dusan Jakovetic, Soummya Kar et al.
We present an algorithm for distributed estimation of an unknown vector parameter $\boldsymbolθ^\ast \in {\mathbb R}^M$ in the presence of heavy-tailed observation and communication noises. Heavy-tailed noises frequently appear, e.g., in densely deployed Internet of Things (IoT) or wireless sensor network systems. The presented algorithm falls within the class of \emph{consensus+innovation} estimators and combats the effect of the heavy-tailed noises by adding general nonlinearities in the consensus and innovations update parts. We present results on almost sure convergence and asymptotic normality of the estimator. In addition, we provide novel analytical studies that reveal interesting tradeoffs between the system noises and the underlying network topology.
95.3OCApr 16
Decentralized Nonconvex Optimization under Heavy-Tailed Noise: Normalization and Optimal ConvergenceShuhua Yu, Dusan Jakovetic, Soummya Kar
Heavy-tailed noise in nonconvex stochastic optimization has garnered increasing research interest, as empirical studies, including those on training attention models, suggest it is a more realistic gradient noise condition. This paper studies first-order nonconvex stochastic optimization under heavy-tailed gradient noise in a decentralized setup, where each node can only communicate with its direct neighbors in a predefined graph. Specifically, we consider a class of heavy-tailed gradient noise that is zero-mean and has only $p$-th moment for $p \in (1, 2]$. We propose GT-NSGDm, Gradient Tracking based Normalized Stochastic Gradient Descent with momentum, that utilizes normalization, in conjunction with gradient tracking and momentum, to cope with heavy-tailed noise on distributed nodes. We show that, when the communication graph admits primitive and doubly stochastic weights, GT-NSGDm guarantees, for the \textit{first} time in the literature, that the expected gradient norm converges at an optimal non-asymptotic rate $O\big(1/T^{(p-1)/(3p-2)}\big)$, which matches the lower bound in the centralized setup. When tail index $p$ is unknown, GT-NSGDm attains a non-asymptotic rate $O\big( 1/T^{(p-1)/(2p)} \big)$ that is, for $p < 2$, topology independent and has a speedup factor $n^{1-1/p}$ in terms of the number of nodes $n$. Finally, experiments on nonconvex linear regression with tokenized synthetic data and decentralized training of language models on a real-world corpus demonstrate that GT-NSGDm is more robust and efficient than baselines.
LGSep 17, 2021Code
Accurate, Interpretable, and Fast Animation: An Iterative, Sparse, and Nonconvex ApproachStevo Rackovic, Claudia Soares, Dusan Jakovetic et al.
Digital human animation relies on high-quality 3D models of the human face: rigs. A face rig must be accurate and, at the same time, fast to compute. One of the most common rigging models is the blendshape model. We propose a novel algorithm for solving the nonconvex inverse rig problem in facial animation. Our approach is model-based, but in contrast with previous model-based approaches, we use a quadratic instead of the linear approximation to the higher order rig model. This increases the accuracy of the solution by 8 percent on average and, confirmed by the empirical results, increases the sparsity of the resulting parameter vector -- an important feature for interpretability by animation artists. The proposed solution is based on a Levenberg-Marquardt (LM) algorithm, applied to a nonconvex constrained problem with sparsity regularization. In order to reduce the complexity of the iterates, a paradigm of Majorization Minimization (MM) is further invoked, which leads to an easy to solve problem that is separable in the parameters at each algorithm iteration. The algorithm is evaluated on a number of animation datasets, proprietary and open-source, and the results indicate the superiority of our method compared to the standard approach based on the linear rig approximation. Although our algorithm targets the specific problem, it might have additional signal processing applications.
59.9OCMay 8
Robust stochastic first order methods in heavy-tailed noise via medoid mini-batch gradient samplingManojlo Vukovic, Dusan Jakovetic
We consider a first order stochastic optimization framework where, at each iteration, $K$ independent identically distributed (i.i.d.) data point samples are drawn, based on which stochastic gradients can be queried. We allow gradient noise to be heavy-tailed, with possibly infinite variances. For the considered heavy-tailed setting, many algorithmic variants have recently been proposed based on gradient clipping or other nonlinear operators (e.g., normalization) applied over noisy gradients. In this paper, we take an alternative approach and propose a novel stochastic first order method dubbed Robust Stochastic Gradient Descent with medoid mini-batch gradient sampling, R-SGD-Mini for short. The core idea of R-SGD-Mini is to split the $K$-sized data batch into $M$ distinct data chunks, form for each chunk the stochastic gradient, and update the solution estimate with respect to the stochastic gradient direction of the chunk that is medoid of gradients of all data-chunks. Under a general class of symmetric heavy-tailed gradient noises and a standard non-convex setting, we establish explicit bounds on the expected time-averaged squared gradient norm. More precisely, we show that the latter quantity converges at rate $\mathcal{O}(T^{-1})$ to a small neighborhood of zero; we explicitly characterize this neighborhood in terms of noise and algorithm's parameters. Moreover, if the time horizon is known in advance, we establish the rate of $\mathcal{O}(T^{-\frac{1}{2}}).$ Furthermore, when clipping is incorporated, we obtain convergence guaranties in the high-probability sense and recover the same rate. Experimental results indicate that R-SGD-Mini and its clipped variant consistently perform favorably compared to SGD, clipped SGD and Median-of-Means based methods.
LGOct 17, 2024
Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability GuaranteesAleksandar Armacki, Shuhua Yu, Pranay Sharma et al.
We study high-probability convergence in online learning, in the presence of heavy-tailed noise. To combat the heavy tails, a general framework of nonlinear SGD methods is considered, subsuming several popular nonlinearities like sign, quantization, component-wise and joint clipping. In our work the nonlinearity is treated in a black-box manner, allowing us to establish unified guarantees for a broad range of nonlinear methods. For symmetric noise and non-convex costs we establish convergence of gradient norm-squared, at a rate $\widetilde{\mathcal{O}}(t^{-1/4})$, while for the last iterate of strongly convex costs we establish convergence to the population optima, at a rate $\mathcal{O}(t^{-ζ})$, where $ζ\in (0,1)$ depends on noise and problem parameters. Further, if the noise is a (biased) mixture of symmetric and non-symmetric components, we show convergence to a neighbourhood of stationarity, whose size depends on the mixture coefficient, nonlinearity and noise. Compared to state-of-the-art, who only consider clipping and require unbiased noise with bounded $p$-th moments, $p \in (1,2]$, we provide guarantees for a broad class of nonlinearities, without any assumptions on noise moments. While the rate exponents in state-of-the-art depend on noise moments and vanish as $p \rightarrow 1$, our exponents are constant and strictly better whenever $p < 6/5$ for non-convex and $p < 8/7$ for strongly convex costs. Experiments validate our theory, showing that clipping is not always the optimal nonlinearity, further underlining the value of a general framework.
LGAug 25, 2025
FedGreed: A Byzantine-Robust Loss-Based Aggregation Method for Federated LearningEmmanouil Kritharakis, Antonios Makris, Dusan Jakovetic et al.
Federated Learning (FL) enables collaborative model training across multiple clients while preserving data privacy by keeping local datasets on-device. In this work, we address FL settings where clients may behave adversarially, exhibiting Byzantine attacks, while the central server is trusted and equipped with a reference dataset. We propose FedGreed, a resilient aggregation strategy for federated learning that does not require any assumptions about the fraction of adversarial participants. FedGreed orders clients' local model updates based on their loss metrics evaluated against a trusted dataset on the server and greedily selects a subset of clients whose models exhibit the minimal evaluation loss. Unlike many existing approaches, our method is designed to operate reliably under heterogeneous (non-IID) data distributions, which are prevalent in real-world deployments. FedGreed exhibits convergence guarantees and bounded optimality gaps under strong adversarial behavior. Experimental evaluations on MNIST, FMNIST, and CIFAR-10 demonstrate that our method significantly outperforms standard and robust federated learning baselines, such as Mean, Trimmed Mean, Median, Krum, and Multi-Krum, in the majority of adversarial scenarios considered, including label flipping and Gaussian noise injection attacks. All experiments were conducted using the Flower federated learning framework.
LGAug 18, 2025
Robust Federated Learning under Adversarial Attacks via Loss-Based Client ClusteringEmmanouil Kritharakis, Dusan Jakovetic, Antonios Makris et al.
Federated Learning (FL) enables collaborative model training across multiple clients without sharing private data. We consider FL scenarios wherein FL clients are subject to adversarial (Byzantine) attacks, while the FL server is trusted (honest) and has a trustworthy side dataset. This may correspond to, e.g., cases where the server possesses trusted data prior to federation, or to the presence of a trusted client that temporarily assumes the server role. Our approach requires only two honest participants, i.e., the server and one client, to function effectively, without prior knowledge of the number of malicious clients. Theoretical analysis demonstrates bounded optimality gaps even under strong Byzantine attacks. Experimental results show that our algorithm significantly outperforms standard and robust FL baselines such as Mean, Trimmed Mean, Median, Krum, and Multi-Krum under various attack strategies including label flipping, sign flipping, and Gaussian noise addition across MNIST, FMNIST, and CIFAR-10 benchmarks using the Flower framework.
MLJul 12, 2025
Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via SymmetrizationAleksandar Armacki, Dragana Bajovic, Dusan Jakovetic et al.
We study convergence in high-probability of SGD-type methods in non-convex optimization and the presence of heavy-tailed noise. To combat the heavy-tailed noise, a general black-box nonlinear framework is considered, subsuming nonlinearities like sign, clipping, normalization and their smooth counterparts. Our first result shows that nonlinear SGD (N-SGD) achieves the rate $\widetilde{\mathcal{O}}(t^{-1/2})$, for any noise with unbounded moments and a symmetric probability density function (PDF). Crucially, N-SGD has exponentially decaying tails, matching the performance of linear SGD under light-tailed noise. To handle non-symmetric noise, we propose two novel estimators, based on the idea of noise symmetrization. The first, dubbed Symmetrized Gradient Estimator (SGE), assumes a noiseless gradient at any reference point is available at the start of training, while the second, dubbed Mini-batch SGE (MSGE), uses mini-batches to estimate the noiseless gradient. Combined with the nonlinear framework, we get N-SGE and N-MSGE methods, respectively, both achieving the same convergence rate and exponentially decaying tails as N-SGD, while allowing for non-symmetric noise with unbounded moments and PDF satisfying a mild technical condition, with N-MSGE additionally requiring bounded noise moment of order $p \in (1,2]$. Compared to works assuming noise with bounded $p$-th moment, our results: 1) are based on a novel symmetrization approach; 2) provide a unified framework and relaxed moment conditions; 3) imply optimal oracle complexity of N-SGD and N-SGE, strictly better than existing works when $p < 2$, while the complexity of N-MSGE is close to existing works. Compared to works assuming symmetric noise with unbounded moments, we: 1) provide a sharper analysis and improved rates; 2) facilitate state-dependent symmetric noise; 3) extend the strong guarantees to non-symmetric noise.
LGOct 21, 2024
Large Deviation Upper Bounds and Improved MSE Rates of Nonlinear SGD: Heavy-tailed Noise and Power of SymmetryAleksandar Armacki, Shuhua Yu, Dragana Bajovic et al.
We study large deviation upper bounds and mean-squared error (MSE) guarantees of a general framework of nonlinear stochastic gradient methods in the online setting, in the presence of heavy-tailed noise. Unlike existing works that rely on the closed form of a nonlinearity (typically clipping), our framework treats the nonlinearity in a black-box manner, allowing us to provide unified guarantees for a broad class of bounded nonlinearities, including many popular ones, like sign, quantization, normalization, as well as component-wise and joint clipping. We provide several strong results for a broad range of step-sizes in the presence of heavy-tailed noise with symmetric probability density function, positive in a neighbourhood of zero and potentially unbounded moments. In particular, for non-convex costs we provide a large deviation upper bound for the minimum norm-squared of gradients, showing an asymptotic tail decay on an exponential scale, at a rate $\sqrt{t} / \log(t)$. We establish the accompanying rate function, showing an explicit dependence on the choice of step-size, nonlinearity, noise and problem parameters. Next, for non-convex costs and the minimum norm-squared of gradients, we derive the optimal MSE rate $\widetilde{\mathcal{O}}(t^{-1/2})$. Moreover, for strongly convex costs and the last iterate, we provide an MSE rate that can be made arbitrarily close to the optimal rate $\mathcal{O}(t^{-1})$, improving on the state-of-the-art results in the presence of heavy-tailed noise. Finally, we establish almost sure convergence of the minimum norm-squared of gradients, providing an explicit rate, which can be made arbitrarily close to $o(t^{-1/4})$.
OCMay 30, 2025
Distributed gradient methods under heavy-tailed communication noiseManojlo Vukovic, Dusan Jakovetic, Dragana Bajovic et al.
We consider a standard distributed optimization problem in which networked nodes collaboratively minimize the sum of their locally known convex costs. For this setting, we address for the first time the fundamental problem of design and analysis of distributed methods to solve the above problem when inter-node communication is subject to \emph{heavy-tailed} noise. Heavy-tailed noise is highly relevant and frequently arises in densely deployed wireless sensor and Internet of Things (IoT) networks. Specifically, we design a distributed gradient-type method that features a carefully balanced mixed time-scale time-varying consensus and gradient contribution step sizes and a bounded nonlinear operator on the consensus update to limit the effect of heavy-tailed noise. Assuming heterogeneous strongly convex local costs with mutually different minimizers that are arbitrarily far apart, we show that the proposed method converges to a neighborhood of the network-wide problem solution in the mean squared error (MSE) sense, and we also characterize the corresponding convergence rate. We further show that the asymptotic MSE can be made arbitrarily small through consensus step-size tuning, possibly at the cost of slowing down the transient error decay. Numerical experiments corroborate our findings and demonstrate the resilience of the proposed method to heavy-tailed (and infinite variance) communication noise. They also show that existing distributed methods, designed for finite-communication-noise-variance settings, fail in the presence of infinite variance noise.
LGFeb 1, 2022
Gradient Based ClusteringAleksandar Armacki, Dragana Bajovic, Dusan Jakovetic et al.
We propose a general approach for distance based clustering, using the gradient of the cost function that measures clustering quality with respect to cluster assignments and cluster center positions. The approach is an iterative two step procedure (alternating between cluster assignment and cluster center updates) and is applicable to a wide range of functions, satisfying some mild assumptions. The main advantage of the proposed approach is a simple and computationally cheap update rule. Unlike previous methods that specialize to a specific formulation of the clustering problem, our approach is applicable to a wide range of costs, including non-Bregman clustering methods based on the Huber loss. We analyze the convergence of the proposed algorithm, and show that it converges to the set of appropriately defined fixed points, under arbitrary center initialization. In the special case of Bregman cost functions, the algorithm converges to the set of centroidal Voronoi partitions, which is consistent with prior works. Numerical experiments on real data demonstrate the effectiveness of the proposed method.
LGFeb 1, 2022
Personalized Federated Learning via Convex ClusteringAleksandar Armacki, Dragana Bajovic, Dusan Jakovetic et al.
We propose a parametric family of algorithms for personalized federated learning with locally convex user costs. The proposed framework is based on a generalization of convex clustering in which the differences between different users' models are penalized via a sum-of-norms penalty, weighted by a penalty parameter $λ$. The proposed approach enables "automatic" model clustering, without prior knowledge of the hidden cluster structure, nor the number of clusters. Analytical bounds on the weight parameter, that lead to simultaneous personalization, generalization and automatic model clustering are provided. The solution to the formulated problem enables personalization, by providing different models across different clusters, and generalization, by providing models different than the per-user models computed in isolation. We then provide an efficient algorithm based on the Parallel Direction Method of Multipliers (PDMM) to solve the proposed formulation in a federated server-users setting. Numerical experiments corroborate our findings. As an interesting byproduct, our results provide several generalizations to convex clustering.
LGJan 21, 2019
Distributed Nesterov gradient methods over arbitrary graphsRan Xin, Dusan Jakovetic, Usman A. Khan
In this letter, we introduce a distributed Nesterov method, termed as $\mathcal{ABN}$, that does not require doubly-stochastic weight matrices. Instead, the implementation is based on a simultaneous application of both row- and column-stochastic weights that makes this method applicable to arbitrary (strongly-connected) graphs. Since constructing column-stochastic weights needs additional information (the number of outgoing neighbors at each agent), not available in certain communication protocols, we derive a variation, termed as FROZEN, that only requires row-stochastic weights but at the expense of additional iterations for eigenvector learning. We numerically study these algorithms for various objective functions and network parameters and show that the proposed distributed Nesterov methods achieve acceleration compared to the current state-of-the-art methods for distributed optimization.