PR LG OCJan 18, 2024

Accelerating Distributed Stochastic Optimization via Self-Repellent Random Walks

arXiv:2401.09665v15.14 citationsICLR

Originality Highly original

AI Analysis

This work improves convergence speed for distributed optimization algorithms, which is incremental but impactful for large-scale machine learning systems.

The paper tackles the problem of slow convergence in distributed stochastic optimization by replacing the standard Markov chain token with a Self-Repellent Random Walk (SRRW), resulting in an O(1/α^2) decrease in asymptotic covariance for optimization errors.

We study a family of distributed stochastic optimization algorithms where gradients are sampled by a token traversing a network of agents in random-walk fashion. Typically, these random-walks are chosen to be Markov chains that asymptotically sample from a desired target distribution, and play a critical role in the convergence of the optimization iterates. In this paper, we take a novel approach by replacing the standard linear Markovian token by one which follows a nonlinear Markov chain - namely the Self-Repellent Radom Walk (SRRW). Defined for any given 'base' Markov chain, the SRRW, parameterized by a positive scalar α, is less likely to transition to states that were highly visited in the past, thus the name. In the context of MCMC sampling on a graph, a recent breakthrough in Doshi et al. (2023) shows that the SRRW achieves O(1/α) decrease in the asymptotic variance for sampling. We propose the use of a 'generalized' version of the SRRW to drive token algorithms for distributed stochastic optimization in the form of stochastic approximation, termed SA-SRRW. We prove that the optimization iterate errors of the resulting SA-SRRW converge to zero almost surely and prove a central limit theorem, deriving the explicit form of the resulting asymptotic covariance matrix corresponding to iterate errors. This asymptotic covariance is always smaller than that of an algorithm driven by the base Markov chain and decreases at rate O(1/α^2) - the performance benefit of using SRRW thereby amplified in the stochastic optimization context. Empirical results support our theoretical findings.

View on arXiv PDF

Similar