MLJan 27, 2023
Algorithmic Stability of Heavy-Tailed SGD with General Loss FunctionsAnant Raj, Lingjiong Zhu, Mert Gürbüzbalaban et al.
Heavy-tail phenomena in stochastic gradient descent (SGD) have been reported in several empirical studies. Experimental evidence in previous works suggests a strong interplay between the heaviness of the tails and generalization behavior of SGD. To address this empirical phenomena theoretically, several works have made strong topological and statistical assumptions to link the generalization error to heavy tails. Very recently, new generalization bounds have been proven, indicating a non-monotonic relationship between the generalization error and heavy tails, which is more pertinent to the reported empirical observations. While these bounds do not require additional topological assumptions given that SGD can be modeled using a heavy-tailed stochastic differential equation (SDE), they can only apply to simple quadratic problems. In this paper, we build on this line of research and develop generalization bounds for a more general class of objective functions, which includes non-convex functions as well. Our approach is based on developing Wasserstein stability bounds for heavy-tailed SDEs and their discretizations, which we then convert to generalization bounds. Our results do not require any nontrivial assumptions; yet, they shed more light to the empirical observations, thanks to the generality of the loss functions.
MLMay 13, 2022
Heavy-Tail Phenomenon in Decentralized SGDMert Gurbuzbalaban, Yuanhan Hu, Umut Simsekli et al.
Recent theoretical studies have shown that heavy-tails can emerge in stochastic optimization due to `multiplicative noise', even under surprisingly simple settings, such as linear regression with Gaussian data. While these studies have uncovered several interesting phenomena, they consider conventional stochastic optimization problems, which exclude decentralized settings that naturally arise in modern machine learning applications. In this paper, we study the emergence of heavy-tails in decentralized stochastic gradient descent (DE-SGD), and investigate the effect of decentralization on the tail behavior. We first show that, when the loss function at each computational node is twice continuously differentiable and strongly convex outside a compact region, the law of the DE-SGD iterates converges to a distribution with polynomially decaying (heavy) tails. To have a more explicit control on the tail exponent, we then consider the case where the loss at each node is a quadratic, and show that the tail-index can be estimated as a function of the step-size, batch-size, and the topological properties of the network of the computational nodes. Then, we provide theoretical and empirical results showing that DE-SGD has heavier tails than centralized SGD. We also compare DE-SGD to disconnected SGD where nodes distribute the data but do not communicate. Our theory uncovers an interesting interplay between the tails and the network structure: we identify two regimes of parameters (stepsize and network size), where DE-SGD can have lighter or heavier tails than disconnected SGD depending on the regime. Finally, to support our theoretical results, we provide numerical experiments conducted on both synthetic data and neural networks.
MLJun 2, 2022
Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least SquaresAnant Raj, Melih Barsbey, Mert Gürbüzbalaban et al.
Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails have links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation (and its Euler discretization) as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a \emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.
MLNov 29, 2022
Penalized Overdamped and Underdamped Langevin Monte Carlo Algorithms for Constrained SamplingMert Gürbüzbalaban, Yuanhan Hu, Lingjiong Zhu
We consider the constrained sampling problem where the goal is to sample from a target distribution $π(x)\propto e^{-f(x)}$ when $x$ is constrained to lie on a convex body $\mathcal{C}$. Motivated by penalty methods from continuous optimization, we propose penalized Langevin Dynamics (PLD) and penalized underdamped Langevin Monte Carlo (PULMC) methods that convert the constrained sampling problem into an unconstrained sampling problem by introducing a penalty function for constraint violations. When $f$ is smooth and gradients are available, we get $\tilde{\mathcal{O}}(d/\varepsilon^{10})$ iteration complexity for PLD to sample the target up to an $\varepsilon$-error where the error is measured in the TV distance and $\tilde{\mathcal{O}}(\cdot)$ hides logarithmic factors. For PULMC, we improve the result to $\tilde{\mathcal{O}}(\sqrt{d}/\varepsilon^{7})$ when the Hessian of $f$ is Lipschitz and the boundary of $\mathcal{C}$ is sufficiently smooth. To our knowledge, these are the first convergence results for underdamped Langevin Monte Carlo methods in the constrained sampling that handle non-convex $f$ and provide guarantees with the best dimension dependency among existing methods with deterministic gradient. If unbiased stochastic estimates of the gradient of $f$ are available, we propose PSGLD and PSGULMC methods that can handle stochastic gradients and are scaleable to large datasets without requiring Metropolis-Hasting correction steps. For PSGLD and PSGULMC, when $f$ is strongly convex and smooth, we obtain $\tilde{\mathcal{O}}(d/\varepsilon^{18})$ and $\tilde{\mathcal{O}}(d\sqrt{d}/\varepsilon^{39})$ iteration complexity in W2 distance. When $f$ is smooth and can be non-convex, we provide finite-time performance bounds and iteration complexity results. Finally, we illustrate the performance on Bayesian LASSO regression and Bayesian constrained deep learning problems.
NAMay 9, 2017
Asymptotic Normality of Extensible Grid SamplingZhijian He, Lingjiong Zhu
Recently, He and Owen (2016) proposed the use of Hilbert's space filling curve (HSFC) in numerical integration as a way of reducing the dimension from $d>1$ to $d=1$. This paper studies the asymptotic normality of the HSFC-based estimate when using scrambled van der Corput sequence as input. We show that the estimate has an asymptotic normal distribution for functions in $C^1([0,1]^d)$, excluding the trivial case of constant functions. The asymptotic normality also holds for discontinuous functions under mild conditions. It was previously known only that scrambled $(0,m,d)$-net quadratures enjoy the asymptotic normality for smooth enough functions, whose mixed partial gradients satisfy a Hölder condition. As a by-product, we find lower bounds for the variance of the HSFC-based estimate. Particularly, for nontrivial functions in $C^1([0,1]^d)$, the low bound is of order $n^{-1-2/d}$, which matches the rate of the upper bound established in He and Owen (2016).
MLFeb 10, 2023
Cyclic and Randomized Stepsizes Invoke Heavier Tails in SGD than Constant StepsizeMert Gürbüzbalaban, Yuanhan Hu, Umut Şimşekli et al.
Cyclic and randomized stepsizes are widely used in the deep learning practice and can often outperform standard stepsize choices such as constant stepsize in SGD. Despite their empirical success, not much is currently known about when and why they can theoretically improve the generalization performance. We consider a general class of Markovian stepsizes for learning, which contain i.i.d. random stepsize, cyclic stepsize as well as the constant stepsize as special cases, and motivated by the literature which shows that heaviness of the tails (measured by the so-called "tail-index") in the SGD iterates is correlated with generalization, we study tail-index and provide a number of theoretical results that demonstrate how the tail-index varies on the stepsize scheduling. Our results bring a new understanding of the benefits of cyclic and randomized stepsizes compared to constant stepsize in terms of the tail behavior. We illustrate our theory on linear regression experiments and show through deep learning experiments that Markovian stepsizes can achieve even a heavier tail and be a viable alternative to cyclic and i.i.d. randomized stepsize rules.
LGNov 18, 2023
Wasserstein Convergence Guarantees for a General Class of Score-Based Generative ModelsXuefeng Gao, Hoang M. Nguyen, Lingjiong Zhu
Score-based generative models (SGMs) is a recent class of deep generative models with state-of-the-art performance in many applications. In this paper, we establish convergence guarantees for a general class of SGMs in 2-Wasserstein distance, assuming accurate score estimates and smooth log-concave data distribution. We specialize our result to several concrete SGMs with specific choices of forward processes modelled by stochastic differential equations, and obtain an upper bound on the iteration complexity for each model, which demonstrates the impacts of different choices of the forward processes. We also provide a lower bound when the data distribution is Gaussian. Numerically, we experiment SGMs with different forward processes, some of which are newly proposed in this paper, for unconditional image generation on CIFAR-10. We find that the experimental results are in good agreement with our theoretical predictions on the iteration complexity, and the models with our newly proposed forward processes can outperform existing models.
LGDec 12, 2024Code
Go With the Flow: Fast Diffusion for Gaussian Mixture ModelsGeorge Rapakoulias, Ali Reza Pedram, Fengjiao Liu et al.
Schrodinger Bridges (SBs) are diffusion processes that steer, in finite time, a given initial distribution to another final one while minimizing a suitable cost functional. Although various methods for computing SBs have recently been proposed in the literature, most of these approaches require computationally expensive training schemes, even for solving low-dimensional problems. In this work, we propose an analytic parametrization of a set of feasible policies for steering the distribution of a dynamical system from one Gaussian Mixture Model (GMM) to another. Instead of relying on standard non-convex optimization techniques, the optimal policy within the set can be approximated as the solution of a low-dimensional linear program whose dimension scales linearly with the number of components in each mixture. The proposed method generalizes naturally to more general classes of dynamical systems, such as controllable linear time-varying systems, enabling efficient solutions to multi-marginal momentum SBs between GMMs, a challenging distribution interpolation problem. We showcase the potential of this approach in low-to-moderate dimensional problems such as image-to-image translation in the latent space of an autoencoder, learning of cellular dynamics using multi-marginal momentum SBs, and various other examples. The implementation is publicly available at https://github.com/georgeRapa/GMMflow.
LGMay 8
Stochastic Transition-Map Distillation for Fast Probabilistic InferenceGeorge Rapakoulias, Peter Garud, Lingjiong Zhu et al.
Diffusion models achieve strong generation quality, diversity, and distribution coverage, but their performance often comes with expensive inference. In this work, we propose Stochastic Transition-Map Distillation (STMD), a teacher-free framework for accelerating diffusion model inference while preserving probabilistic sample generation. In contrast to score-based diffusion models, whose denoising parametrization models the mean of the posterior distribution, STMD distills the full transition map associated with the sampling stochastic differential equation (SDE). We parameterize these SDE transitions with a conditional Mean Flow model, yielding a one- or few-step stochastic sampler that retains the transition structure of the underlying diffusion process. This perspective is especially useful for downstream tasks that require stochastic inference, such as diffusion posterior sampling, inverse problems, and energy-based fine-tuning. Compared to recent distillation methods, STMD requires no pretrained teacher, bi-level optimization, or trajectory simulation and caching, enabling efficient and scalable training. We derive convergence bounds for our method in the Wasserstein distance, providing a strong theoretical foundation for our approach, and validate STMD on various image generation examples on the MNIST, CIFAR-10, and CelebA datasets.
MLJan 31, 2024
Convergence Analysis for General Probability Flow ODEs of Diffusion Models in Wasserstein DistancesXuefeng Gao, Lingjiong Zhu
Score-based generative modeling with probability flow ordinary differential equations (ODEs) has achieved remarkable success in a variety of applications. While various fast ODE-based samplers have been proposed in the literature and employed in practice, the theoretical understandings about convergence properties of the probability flow ODE are still quite limited. In this paper, we provide the first non-asymptotic convergence analysis for a general class of probability flow ODE samplers in 2-Wasserstein distance, assuming accurate score estimates and smooth log-concave data distributions. We then consider various examples and establish results on the iteration complexity of the corresponding ODE-based samplers. Our proof technique relies on spelling out explicitly the contraction rate for the continuous-time ODE and analyzing the discretization and score-matching errors using synchronous coupling; the challenge in our analysis mainly arises from the inherent non-autonomy of the probability flow ODE and the specific exponential integrator that we study.
MLMay 1
Decentralized Proximal Stochastic Gradient Langevin DynamicsMohammad Rafiqul Islam, Lingjiong Zhu
We propose Decentralized Proximal Stochastic Gradient Langevin Dynamics (DE-PSGLD), a decentralized Markov chain Monte Carlo (MCMC) algorithm for sampling from a log-concave probability distribution constrained to a convex domain. Constraints are enforced through a shared proximal regularization based on the Moreau-Yosida envelope, enabling unconstrained updates while preserving consistency with the target constrained posterior. We establish non-asymptotic convergence guarantees in the 2-Wasserstein distance for both individual agent iterates and their network averages. Our analysis shows that DE-PSGLD converges to a regularized Gibbs distribution and quantifies the bias introduced by the proximal approximation. We evaluate DE-PSGLD for different sampling problems on synthetic and real datasets. As the first decentralized approach for constrained domains, our algorithm exhibits fast posterior concentration and high predictive accuracy.
LGDec 2, 2024
Generalized EXTRA stochastic gradient Langevin dynamicsMert Gurbuzbalaban, Mohammad Rafiqul Islam, Xiaoyu Wang et al.
Langevin algorithms are popular Markov Chain Monte Carlo methods for Bayesian learning, particularly when the aim is to sample from the posterior distribution of a parametric model, given the input data and the prior distribution over the model parameters. Their stochastic versions such as stochastic gradient Langevin dynamics (SGLD) allow iterative learning based on randomly sampled mini-batches of large datasets and are scalable to large datasets. However, when data is decentralized across a network of agents subject to communication and privacy constraints, standard SGLD algorithms cannot be applied. Instead, we employ decentralized SGLD (DE-SGLD) algorithms, where Bayesian learning is performed collaboratively by a network of agents without sharing individual data. Nonetheless, existing DE-SGLD algorithms induce a bias at every agent that can negatively impact performance; this bias persists even when using full batches and is attributable to network effects. Motivated by the EXTRA algorithm and its generalizations for decentralized optimization, we propose the generalized EXTRA stochastic gradient Langevin dynamics, which eliminates this bias in the full-batch setting. Moreover, we show that, in the mini-batch setting, our algorithm provides performance bounds that significantly improve upon those of standard DE-SGLD algorithms in the literature. Our numerical results also demonstrate the efficiency of the proposed approach.
MLMar 4, 2024
Privacy of SGD under Gaussian or Heavy-Tailed Noise: Guarantees without Gradient ClippingUmut Şimşekli, Mert Gürbüzbalaban, Sinan Yıldırım et al.
The injection of heavy-tailed noise into the iterates of stochastic gradient descent (SGD) has garnered growing interest in recent years due to its theoretical and empirical benefits for optimization and generalization. However, its implications for privacy preservation remain largely unexplored. Aiming to bridge this gap, we provide differential privacy (DP) guarantees for noisy SGD, when the injected noise follows an $α$-stable distribution, which includes a spectrum of heavy-tailed distributions (with infinite variance) as well as the light-tailed Gaussian distribution. Considering the $(ε, δ)$-DP framework, we show that SGD with heavy-tailed perturbations achieves $(0, O(1/n))$-DP for a broad class of loss functions which can be non-convex, where $n$ is the number of data points. As a remarkable byproduct, contrary to prior work that necessitates bounded sensitivity for the gradients or clipping the iterates, our theory can handle unbounded gradients without clipping, and reveals that under mild assumptions, such a projection step is not actually necessary. Our results suggest that, given other benefits of heavy-tails in optimization, heavy-tailed noising schemes can be a viable alternative to their light-tailed counterparts.
CVFeb 13, 2024
Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer ModelsShaeke Salman, Md Montasir Bin Shams, Xiuwen Liu et al.
Transformer-based models have dominated natural language processing and other areas in the last few years due to their superior (zero-shot) performance on benchmark datasets. However, these models are poorly understood due to their complexity and size. While probing-based methods are widely used to understand specific properties, the structures of the representation space are not systematically characterized; consequently, it is unclear how such models generalize and overgeneralize to new inputs beyond datasets. In this paper, based on a new gradient descent optimization method, we are able to explore the embedding space of a commonly used vision-language model. Using the Imagenette dataset, we show that while the model achieves over 99\% zero-shot classification performance, it fails systematic evaluations completely. Using a linear approximation, we provide a framework to explain the striking differences. We have also obtained similar results using a different model to support that our results are applicable to other transformer models with continuous inputs. We also propose a robust way to detect the modified images.
LGJan 20, 2025
Non-Reversible Langevin Algorithms for Constrained SamplingHengrong Du, Qi Feng, Changwei Tu et al.
We consider the constrained sampling problem where the goal is to sample from a target distribution on a constrained domain. We propose skew-reflected non-reversible Langevin dynamics (SRNLD), a continuous-time stochastic differential equation with skew-reflected boundary. We obtain non-asymptotic convergence rate of SRNLD to the target distribution in both total variation and 1-Wasserstein distances. By breaking reversibility, we show that the convergence is faster than the special case of the reversible dynamics. Based on the discretization of SRNLD, we propose skew-reflected non-reversible Langevin Monte Carlo (SRNLMC), and obtain non-asymptotic discretization error from SRNLD, and convergence guarantees to the target distribution in 1-Wasserstein distance. We show better performance guarantees than the projected Langevin Monte Carlo in the literature that is based on the reversible dynamics. Numerical experiments are provided for both synthetic and real datasets to show efficiency of the proposed algorithms.
MLFeb 2, 2025
Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed NoiseThanh Dang, Melih Barsbey, A K M Rokonuzzaman Sonet et al.
Understanding the generalization properties of optimization algorithms under heavy-tailed noise has gained growing attention. However, the existing theoretical results mainly focus on stochastic gradient descent (SGD) and the analysis of heavy-tailed optimizers beyond SGD is still missing. In this work, we establish generalization bounds for SGD with momentum (SGDm) under heavy-tailed gradient noise. We first consider the continuous-time limit of SGDm, i.e., a Levy-driven stochastic differential equation (SDE), and establish quantitative Wasserstein algorithmic stability bounds for a class of potentially non-convex loss functions. Our bounds reveal a remarkable observation: For quadratic loss functions, we show that SGDm admits a worse generalization bound in the presence of heavy-tailed noise, indicating that the interaction of momentum and heavy tails can be harmful for generalization. We then extend our analysis to discrete-time and develop a uniform-in-time discretization error bound, which, to our knowledge, is the first result of its kind for SDEs with degenerate noise. This result shows that, with appropriately chosen step-sizes, the discrete dynamics retain the generalization properties of the limiting SDE. We illustrate our theory on both synthetic quadratic problems and neural networks.
MLNov 19, 2025
Rényi Differential Privacy for Heavy-Tailed SDEs via Fractional Poincaré InequalitiesBenjamin Dupuis, Mert Gürbüzbalaban, Umut Şimşekli et al.
Characterizing the differential privacy (DP) of learning algorithms has become a major challenge in recent years. In parallel, many studies suggested investigating the behavior of stochastic gradient descent (SGD) with heavy-tailed noise, both as a model for modern deep learning models and to improve their performance. However, most DP bounds focus on light-tailed noise, where satisfactory guarantees have been obtained but the proposed techniques do not directly extend to the heavy-tailed setting. Recently, the first DP guarantees for heavy-tailed SGD were obtained. These results provide $(0,δ)$-DP guarantees without requiring gradient clipping. Despite casting new light on the link between DP and heavy-tailed algorithms, these results have a strong dependence on the number of parameters and cannot be extended to other DP notions like the well-established Rényi differential privacy (RDP). In this work, we propose to address these limitations by deriving the first RDP guarantees for heavy-tailed SDEs, as well as their discretized counterparts. Our framework is based on new Rényi flow computations and the use of well-established fractional Poincaré inequalities. Under the assumption that such inequalities are satisfied, we obtain DP guarantees that have a much weaker dependence on the dimension compared to prior art.
OCNov 16, 2025
DIGing--SGLD: Decentralized and Scalable Langevin Sampling over Time--Varying NetworksWaheed U. Bajwa, Mert Gurbuzbalaban, Mustafa Ali Kutbay et al.
Sampling from a target distribution induced by training data is central to Bayesian learning, with Stochastic Gradient Langevin Dynamics (SGLD) serving as a key tool for scalable posterior sampling and decentralized variants enabling learning when data are distributed across a network of agents. This paper introduces DIGing-SGLD, a decentralized SGLD algorithm designed for scalable Bayesian learning in multi-agent systems operating over time-varying networks. Existing decentralized SGLD methods are restricted to static network topologies, and many exhibit steady-state sampling bias caused by network effects, even when full batches are used. DIGing-SGLD overcomes these limitations by integrating Langevin-based sampling with the gradient-tracking mechanism of the DIGing algorithm, originally developed for decentralized optimization over time-varying networks, thereby enabling efficient and bias-free sampling without a central coordinator. To our knowledge, we provide the first finite-time non-asymptotic Wasserstein convergence guarantees for decentralized SGLD-based sampling over time-varying networks, with explicit constants. Under standard strong convexity and smoothness assumptions, DIGing-SGLD achieves geometric convergence to an $O(\sqrtη)$ neighborhood of the target distribution, where $η$ is the stepsize, with dependence on the target accuracy matching the best-known rates for centralized and static-network SGLD algorithms using constant stepsize. Numerical experiments on Bayesian linear and logistic regression validate the theoretical results and demonstrate the strong empirical performance of DIGing-SGLD under dynamically evolving network conditions.
MLSep 23, 2025
Anchored Langevin AlgorithmsMert Gurbuzbalaban, Hoang M. Nguyen, Xicheng Zhang et al.
Standard first-order Langevin algorithms such as the unadjusted Langevin algorithm (ULA) are obtained by discretizing the Langevin diffusion and are widely used for sampling in machine learning because they scale to high dimensions and large datasets. However, they face two key limitations: (i) they require differentiable log-densities, excluding targets with non-differentiable components; and (ii) they generally fail to sample heavy-tailed targets. We propose anchored Langevin dynamics, a unified approach that accommodates non-differentiable targets and certain classes of heavy-tailed distributions. The method replaces the original potential with a smooth reference potential and modifies the Langevin diffusion via multiplicative scaling. We establish non-asymptotic guarantees in the 2-Wasserstein distance to the target distribution and provide an equivalent formulation derived via a random time change of the Langevin diffusion. We provide numerical experiments to illustrate the theory and practical performance of our proposed approach.
COAug 31, 2025
Regime-Switching Langevin Monte Carlo AlgorithmsXiaoyu Wang, Yingli Wang, Lingjiong Zhu
Langevin Monte Carlo (LMC) algorithms are popular Markov Chain Monte Carlo (MCMC) methods to sample a target probability distribution, which arises in many applications in machine learning. Inspired by regime-switching stochastic differential equations in the probability literature, we propose and study regime-switching Langevin dynamics (RS-LD) and regime-switching kinetic Langevin dynamics (RS-KLD). Based on their discretizations, we introduce regime-switching Langevin Monte Carlo (RS-LMC) and regime-switching kinetic Langevin Monte Carlo (RS-KLMC) algorithms, which can also be viewed as LMC and KLMC algorithms with random stepsizes. We also propose frictional-regime-switching kinetic Langevin dynamics (FRS-KLD) and its associated algorithm frictional-regime-switching kinetic Langevin Monte Carlo (FRS-KLMC), which can also be viewed as the KLMC algorithm with random frictional coefficients. We provide their 2-Wasserstein non-asymptotic convergence guarantees to the target distribution, and analyze the iteration complexities. Numerical experiments using both synthetic and real data are provided to illustrate the efficiency of our proposed algorithms.
MLAug 24, 2025
High-Order Langevin Monte Carlo AlgorithmsThanh Dang, Mert Gurbuzbalaban, Mohammad Rafiqul Islam et al.
Langevin algorithms are popular Markov chain Monte Carlo (MCMC) methods for large-scale sampling problems that often arise in data science. We propose Monte Carlo algorithms based on the discretizations of $P$-th order Langevin dynamics for any $P\geq 3$. Our design of $P$-th order Langevin Monte Carlo (LMC) algorithms is by combining splitting and accurate integration methods. We obtain Wasserstein convergence guarantees for sampling from distributions with log-concave and smooth densities. Specifically, the mixing time of the $P$-th order LMC algorithm scales as $O\left(d^{\frac{1}{R}}/ε^{\frac{1}{2R}}\right)$ for $R=4\cdot 1_{\{ P=3\}}+ (2P-1)\cdot 1_{\{ P\geq 4\}}$, which has a better dependence on the dimension $d$ and the accuracy level $ε$ as $P$ grows. Numerical experiments illustrate the efficiency of our proposed algorithms.
MLJun 9, 2025
Accelerating Constrained Sampling: A Large Deviations ApproachYingli Wang, Changwei Tu, Xiaoyu Wang et al.
The problem of sampling a target probability distribution on a constrained domain arises in many applications including machine learning. For constrained sampling, various Langevin algorithms such as projected Langevin Monte Carlo (PLMC) based on the discretization of reflected Langevin dynamics (RLD) and more generally skew-reflected non-reversible Langevin Monte Carlo (SRNLMC) based on the discretization of skew-reflected non-reversible Langevin dynamics (SRNLD) have been proposed and studied in the literature. This work focuses on the long-time behavior of SRNLD, where a skew-symmetric matrix is added to RLD. Although acceleration for SRNLD has been studied, it is not clear how one should design the skew-symmetric matrix in the dynamics to achieve good performance in practice. We establish a large deviation principle (LDP) for the empirical measure of SRNLD when the skew-symmetric matrix is chosen such that its product with the inward unit normal vector field on the boundary is zero. By explicitly characterizing the rate functions, we show that this choice of the skew-symmetric matrix accelerates the convergence to the target distribution compared to RLD and reduces the asymptotic variance. Numerical experiments for SRNLMC based on the proposed skew-symmetric matrix show superior performance, which validate the theoretical findings from the large deviations theory.
LGFeb 4, 2025
BRIDLE: Generalized Self-supervised Learning with QuantizationHoang M. Nguyen, Satya N. Shukla, Qiang Zhang et al.
Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.
MLMay 20, 2023
Uniform-in-Time Wasserstein Stability Bounds for (Noisy) Stochastic Gradient DescentLingjiong Zhu, Mert Gurbuzbalaban, Anant Raj et al.
Algorithmic stability is an important notion that has proven powerful for deriving generalization bounds for practical algorithms. The last decade has witnessed an increasing number of stability bounds for different algorithms applied on different classes of loss functions. While these bounds have illuminated various properties of optimization algorithms, the analysis of each case typically required a different proof technique with significantly different mathematical tools. In this study, we make a novel connection between learning theory and applied probability and introduce a unified guideline for proving Wasserstein stability bounds for stochastic optimization algorithms. We illustrate our approach on stochastic gradient descent (SGD) and we obtain time-uniform stability bounds (i.e., the bound does not increase with the number of iterations) for strongly convex losses and non-convex losses with additive noise, where we recover similar results to the prior art or extend them to more general cases by using a single proof technique. Our approach is flexible and can be generalizable to other popular optimizers, as it mainly requires developing Lyapunov functions, which are often readily available in the literature. It also illustrates that ergodicity is an important component for obtaining time-uniform bounds -- which might not be achieved for convex or non-convex losses unless additional noise is injected to the iterates. Finally, we slightly stretch our analysis technique and prove time-uniform bounds for SGD under convex and non-convex losses (without additional additive noise), which, to our knowledge, is novel.
MLJun 9, 2021
Fractal Structure and Generalization Properties of Stochastic Optimization AlgorithmsAlexander Camuto, George Deligiannidis, Murat A. Erdogdu et al.
Understanding generalization in deep learning has been one of the major challenges in statistical learning theory over the last decade. While recent work has illustrated that the dataset and the training algorithm must be taken into account in order to obtain meaningful generalization bounds, it is still theoretically not clear which properties of the data and the algorithm determine the generalization performance. In this study, we approach this problem from a dynamical systems theory perspective and represent stochastic optimization algorithms as random iterated function systems (IFS). Well studied in the dynamical systems literature, under mild assumptions, such IFSs can be shown to be ergodic with an invariant measure that is often supported on sets with a fractal structure. As our main contribution, we prove that the generalization error of a stochastic optimization algorithm can be bounded based on the `complexity' of the fractal structure that underlies its invariant measure. Leveraging results from dynamical systems theory, we show that the generalization error can be explicitly linked to the choice of the algorithm (e.g., stochastic gradient descent -- SGD), algorithm hyperparameters (e.g., step-size, batch-size), and the geometry of the problem (e.g., Hessian of the loss). We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden-layered neural networks) and algorithms (e.g., SGD and preconditioned variants), and obtain analytical estimates for our bound.For modern neural networks, we develop an efficient algorithm to compute the developed bound and support our theory with various experiments on neural networks.
OCFeb 20, 2021
Convergence Rates of Stochastic Gradient Descent under Infinite Noise VarianceHongjian Wang, Mert Gürbüzbalaban, Lingjiong Zhu et al.
Recent studies have provided both empirical and theoretical evidence illustrating that heavy tails can emerge in stochastic gradient descent (SGD) in various scenarios. Such heavy tails potentially result in iterates with diverging variance, which hinders the use of conventional convergence analysis techniques that rely on the existence of the second-order moments. In this paper, we provide convergence guarantees for SGD under a state-dependent and heavy-tailed noise with a potentially infinite variance, for a class of strongly convex objectives. In the case where the $p$-th moment of the noise exists for some $p\in [1,2)$, we first identify a condition on the Hessian, coined '$p$-positive (semi-)definiteness', that leads to an interesting interpolation between positive semi-definite matrices ($p=2$) and diagonally dominant matrices with non-negative diagonal entries ($p=1$). Under this condition, we then provide a convergence rate for the distance to the global optimum in $L^p$. Furthermore, we provide a generalized central limit theorem, which shows that the properly scaled Polyak-Ruppert averaging converges weakly to a multivariate $α$-stable random vector. Our results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum without necessitating any modification neither to the loss function or to the algorithm itself, as typically required in robust statistics. We demonstrate the implications of our results to applications such as linear regression and generalized linear models subject to heavy-tailed data.
MLFeb 13, 2021
Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise InjectionsAlexander Camuto, Xiaoyu Wang, Lingjiong Zhu et al.
Gaussian noise injections (GNIs) are a family of simple and widely-used regularisation methods for training neural networks, where one injects additive or multiplicative Gaussian noise to the network activations at every iteration of the optimisation algorithm, which is typically chosen as stochastic gradient descent (SGD). In this paper we focus on the so-called `implicit effect' of GNIs, which is the effect of the injected noise on the dynamics of SGD. We show that this effect induces an asymmetric heavy-tailed noise on SGD gradient updates. In order to model this modified dynamics, we first develop a Langevin-like stochastic differential equation that is driven by a general family of asymmetric heavy-tailed noise. Using this model we then formally prove that GNIs induce an `implicit bias', which varies depending on the heaviness of the tails and the level of asymmetry. Our empirical results confirm that different types of neural networks trained with GNIs are well-modelled by the proposed dynamics and that the implicit effect of these injections induces a bias that degrades the performance of networks.
MLJul 1, 2020
Decentralized Stochastic Gradient Langevin Dynamics and Hamiltonian Monte CarloMert Gürbüzbalaban, Xuefeng Gao, Yuanhan Hu et al.
Stochastic gradient Langevin dynamics (SGLD) and stochastic gradient Hamiltonian Monte Carlo (SGHMC) are two popular Markov Chain Monte Carlo (MCMC) algorithms for Bayesian inference that can scale to large datasets, allowing to sample from the posterior distribution of the parameters of a statistical model given the input data and the prior distribution over the model parameters. However, these algorithms do not apply to the decentralized learning setting, when a network of agents are working collaboratively to learn the parameters of a statistical model without sharing their individual data due to privacy reasons or communication constraints. We study two algorithms: Decentralized SGLD (DE-SGLD) and Decentralized SGHMC (DE-SGHMC) which are adaptations of SGLD and SGHMC methods that allow scaleable Bayesian inference in the decentralized setting for large datasets. We show that when the posterior distribution is strongly log-concave and smooth, the iterates of these algorithms converge linearly to a neighborhood of the target distribution in the 2-Wasserstein distance if their parameters are selected appropriately. We illustrate the efficiency of our algorithms on decentralized Bayesian linear regression and Bayesian logistic regression problems.
OCJun 8, 2020
The Heavy-Tail Phenomenon in SGDMert Gurbuzbalaban, Umut Şimşekli, Lingjiong Zhu
In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the `flatness' of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize $η$ to the batch-size $b$, which essentially controls the magnitude of the stochastic gradient noise, and (iii) the `tail-index', which measures the heaviness of the tails of the network weights at convergence. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters $η$ and $b$, the SGD iterates will converge to a \emph{heavy-tailed} stationary distribution. We rigorously prove this claim in the setting of quadratic optimization: we show that even in a simple linear regression problem with independent and identically distributed data whose distribution has finite moments of all order, the iterates can be heavy-tailed with infinite variance. We further characterize the behavior of the tails with respect to algorithm parameters, the dimension, and the curvature. We then translate our results into insights about the behavior of SGD in deep learning. We support our theory with experiments conducted on synthetic data, fully connected, and convolutional neural networks.
OCApr 6, 2020
Non-Convex Optimization via Non-Reversible Stochastic Gradient Langevin DynamicsYuanhan Hu, Xiaoyu Wang, Xuefeng Gao et al.
Stochastic Gradient Langevin Dynamics (SGLD) is a powerful algorithm for optimizing a non-convex objective, where a controlled and properly scaled Gaussian noise is added to the stochastic gradients to steer the iterates towards a global minimum. SGLD is based on the overdamped Langevin diffusion which is reversible in time. By adding an anti-symmetric matrix to the drift term of the overdamped Langevin diffusion, one gets a non-reversible diffusion that converges to the same stationary distribution with a faster convergence rate. In this paper, we study the non reversible Stochastic Gradient Langevin Dynamics (NSGLD) which is based on discretization of the non-reversible Langevin diffusion. We provide finite-time performance bounds for the global convergence of NSGLD for solving stochastic non-convex optimization problems. Our results lead to non-asymptotic guarantees for both population and empirical risk minimization problems. Numerical experiments for Bayesian independent component analysis and neural network models show that NSGLD can outperform SGLD with proper choices of the anti-symmetric matrix.
MLFeb 13, 2020
Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient NoiseUmut Şimşekli, Lingjiong Zhu, Yee Whye Teh et al.
Stochastic gradient descent with momentum (SGDm) is one of the most popular optimization algorithms in deep learning. While there is a rich theory of SGDm for convex problems, the theory is considerably less developed in the context of deep learning where the problem is non-convex and the gradient noise might exhibit a heavy-tailed behavior, as empirically observed in recent studies. In this study, we consider a \emph{continuous-time} variant of SGDm, known as the underdamped Langevin dynamics (ULD), and investigate its asymptotic properties under heavy-tailed perturbations. Supported by recent studies from statistical physics, we argue both theoretically and empirically that the heavy-tails of such perturbations can result in a bias even when the step-size is small, in the sense that \emph{the optima of stationary distribution} of the dynamics might not match \emph{the optima of the cost function to be optimized}. As a remedy, we develop a novel framework, which we coin as \emph{fractional} ULD (FULD), and prove that FULD targets the so-called Gibbs distribution, whose optima exactly match the optima of the original cost. We observe that the Euler discretization of FULD has noteworthy algorithmic similarities with \emph{natural gradient} methods and \emph{gradient clipping}, bringing a new perspective on understanding their role in deep learning. We support our theory with experiments conducted on a synthetic model and neural networks.
OCOct 19, 2019
Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent NetworksAlireza Fallah, Mert Gurbuzbalaban, Asuman Ozdaglar et al.
We study distributed stochastic gradient (D-SG) method and its accelerated variant (D-ASG) for solving decentralized strongly convex stochastic optimization problems where the objective function is distributed over several computational units, lying on a fixed but arbitrary connected communication graph, subject to local communication constraints where noisy estimates of the gradients are available. We develop a framework which allows to choose the stepsize and the momentum parameters of these algorithms in a way to optimize performance by systematically trading off the bias, variance, robustness to gradient noise and dependence to network effects. When gradients do not contain noise, we also prove that distributed accelerated methods can \emph{achieve acceleration}, requiring $\mathcal{O}(κ\log(1/\varepsilon))$ gradient evaluations and $\mathcal{O}(κ\log(1/\varepsilon))$ communications to converge to the same fixed point with the non-accelerated variant where $κ$ is the condition number and $\varepsilon$ is the target accuracy. To our knowledge, this is the first acceleration result where the iteration complexity scales with the square root of the condition number in the context of \emph{primal} distributed inexact first-order methods. For quadratic functions, we also provide finer performance bounds that are tight with respect to bias and variance terms. Finally, we study a multistage version of D-ASG with parameters carefully varied over stages to ensure exact $\mathcal{O}(-k/\sqrtκ)$ linear decay in the bias term as well as optimal $\mathcal{O}(σ^2/k)$ in the variance term. We illustrate through numerical experiments that our approach results in practical algorithms that are robust to gradient noise and that can outperform existing methods.
MLJan 22, 2019
Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein DistancesBugra Can, Mert Gurbuzbalaban, Lingjiong Zhu
Momentum methods such as Polyak's heavy ball (HB) method, Nesterov's accelerated gradient (AG) as well as accelerated projected gradient (APG) method have been commonly used in machine learning practice, but their performance is quite sensitive to noise in the gradients. We study these methods under a first-order stochastic oracle model where noisy estimates of the gradients are available. For strongly convex problems, we show that the distribution of the iterates of AG converges with the accelerated $O(\sqrtκ\log(1/\varepsilon))$ linear rate to a ball of radius $\varepsilon$ centered at a unique invariant distribution in the 1-Wasserstein metric where $κ$ is the condition number as long as the noise variance is smaller than an explicit upper bound we can provide. Our analysis also certifies linear convergence rates as a function of the stepsize, momentum parameter and the noise variance; recovering the accelerated rates in the noiseless case and quantifying the level of noise that can be tolerated to achieve a given performance. In the special case of strongly convex quadratic objectives, we can show accelerated linear rates in the $p$-Wasserstein metric for any $p\geq 1$ with improved sensitivity to noise for both AG and HB through a non-asymptotic analysis under some additional assumptions on the noise structure. Our analysis for HB and AG also leads to improved non-asymptotic convergence bounds in suboptimality for both deterministic and stochastic settings which is of independent interest. To the best of our knowledge, these are the first linear convergence results for stochastic momentum methods under the stochastic oracle model. We also extend our results to the APG method and weakly convex functions showing accelerated rates when the noise magnitude is sufficiently small.
OCDec 19, 2018
Breaking Reversibility Accelerates Langevin Dynamics for Global Non-Convex OptimizationXuefeng Gao, Mert Gurbuzbalaban, Lingjiong Zhu
Langevin dynamics (LD) has been proven to be a powerful technique for optimizing a non-convex objective as an efficient algorithm to find local minima while eventually visiting a global minimum on longer time-scales. LD is based on the first-order Langevin diffusion which is reversible in time. We study two variants that are based on non-reversible Langevin diffusions: the underdamped Langevin dynamics (ULD) and the Langevin dynamics with a non-symmetric drift (NLD). Adopting the techniques of Tzen, Liang and Raginsky (2018) for LD to non-reversible diffusions, we show that for a given local minimum that is within an arbitrary distance from the initialization, with high probability, either the ULD trajectory ends up somewhere outside a small neighborhood of this local minimum within a recurrence time which depends on the smallest eigenvalue of the Hessian at the local minimum or they enter this neighborhood by the recurrence time and stay there for a potentially exponentially long escape time. The ULD algorithms improve upon the recurrence time obtained for LD in Tzen, Liang and Raginsky (2018) with respect to the dependency on the smallest eigenvalue of the Hessian at the local minimum. Similar result and improvement are obtained for the NLD algorithm. We also show that non-reversible variants can exit the basin of attraction of a local minimum faster in discrete time when the objective has two local minima separated by a saddle point and quantify the amount of improvement. Our analysis suggests that non-reversible Langevin algorithms are more efficient to locate a local minimum as well as exploring the state space. Our analysis is based on the quadratic approximation of the objective around a local minimum. As a by-product of our analysis, we obtain optimal mixing rates for quadratic objectives in the 2-Wasserstein distance for two non-reversible Langevin algorithms we consider.
OCSep 12, 2018
Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Non-Convex Stochastic Optimization: Non-Asymptotic Performance Bounds and Momentum-Based AccelerationXuefeng Gao, Mert Gürbüzbalaban, Lingjiong Zhu
Stochastic gradient Hamiltonian Monte Carlo (SGHMC) is a variant of stochastic gradient with momentum where a controlled and properly scaled Gaussian noise is added to the stochastic gradients to steer the iterates towards a global minimum. Many works reported its empirical success in practice for solving stochastic non-convex optimization problems, in particular it has been observed to outperform overdamped Langevin Monte Carlo-based methods such as stochastic gradient Langevin dynamics (SGLD) in many applications. Although asymptotic global convergence properties of SGHMC are well known, its finite-time performance is not well-understood. In this work, we study two variants of SGHMC based on two alternative discretizations of the underdamped Langevin diffusion. We provide finite-time performance bounds for the global convergence of both SGHMC variants for solving stochastic non-convex optimization problems with explicit constants. Our results lead to non-asymptotic guarantees for both population and empirical risk minimization problems. For a fixed target accuracy level, on a class of non-convex problems, we obtain complexity bounds for SGHMC that can be tighter than those for SGLD. These results show that acceleration with momentum is possible in the context of global non-convex optimization.
MFJul 4, 2017
Asymptotics for the Euler-Discretized Hull-White Stochastic Volatility ModelDan Pirjol, Lingjiong Zhu
We consider the stochastic volatility model $dS_t = σ_t S_t dW_t,dσ_t = ωσ_t dZ_t$, with $(W_t,Z_t)$ uncorrelated standard Brownian motions. This is a special case of the Hull-White and the $β=1$ (log-normal) SABR model, which are widely used in financial practice. We study the properties of this model, discretized in time under several applications of the Euler-Maruyama scheme, and point out that the resulting model has certain properties which are different from those of the continuous time model. We study the asymptotics of the time-discretized model in the $n\to \infty$ limit of a very large number of time steps of size $τ$, at fixed $β=\frac12ω^2τn^2$ and $ρ=σ_0^2τ$, and derive three results: i) almost sure limits, ii) fluctuation results, and iii) explicit expressions for growth rates (Lyapunov exponents) of the positive integer moments of $S_t$. Under the Euler-Maruyama discretization for $(S_t,\log σ_t)$, the Lyapunov exponents have a phase transition, which appears in numerical simulations of the model as a numerical explosion of the asset price moments. We derive criteria for the appearance of these explosions.