OC LG MLJan 22, 2019

Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization

Thanh Huy Nguyen, Umut Şimşekli, Gaël Richard

arXiv:1901.07487v114.131 citations

Originality Incremental advance

AI Analysis

This work addresses the theoretical understanding of heavy-tailed noise in optimization algorithms for machine learning practitioners, but it is incremental as it extends prior asymptotic results to a non-asymptotic analysis.

The paper tackles the problem of analyzing the non-asymptotic convergence of Fractional Langevin Monte Carlo (FLMC) for non-convex optimization, proving finite-time bounds on expected suboptimality and showing that FLMC's weak-error increases faster than LMC, suggesting the use of smaller step-sizes.

Recent studies on diffusion-based sampling methods have shown that Langevin Monte Carlo (LMC) algorithms can be beneficial for non-convex optimization, and rigorous theoretical guarantees have been proven for both asymptotic and finite-time regimes. Algorithmically, LMC-based algorithms resemble the well-known gradient descent (GD) algorithm, where the GD recursion is perturbed by an additive Gaussian noise whose variance has a particular form. Fractional Langevin Monte Carlo (FLMC) is a recently proposed extension of LMC, where the Gaussian noise is replaced by a heavy-tailed α-stable noise. As opposed to its Gaussian counterpart, these heavy-tailed perturbations can incur large jumps and it has been empirically demonstrated that the choice of α-stable noise can provide several advantages in modern machine learning problems, both in optimization and sampling contexts. However, as opposed to LMC, only asymptotic convergence properties of FLMC have been yet established. In this study, we analyze the non-asymptotic behavior of FLMC for non-convex optimization and prove finite-time bounds for its expected suboptimality. Our results show that the weak-error of FLMC increases faster than LMC, which suggests using smaller step-sizes in FLMC. We finally extend our results to the case where the exact gradients are replaced by stochastic gradients and show that similar results hold in this setting as well.

View on arXiv PDF

Similar