Stochastic Gradient MCMC with Stale Gradients
This work addresses the challenge of efficient Bayesian inference in distributed settings, providing theoretical insights and validation for scalable SG-MCMC with stale gradients, which is incremental as it builds on existing SG-MCMC methods.
The paper tackles the problem of using stale gradients in stochastic gradient MCMC (SG-MCMC) for large-scale Bayesian learning, showing that while bias and MSE depend on staleness, estimation variance is independent of it, leading to a linear speedup in variance reduction with the number of workers in distributed systems.
Stochastic gradient MCMC (SG-MCMC) has played an important role in large-scale Bayesian learning, with well-developed theoretical convergence properties. In such applications of SG-MCMC, it is becoming increasingly popular to employ distributed systems, where stochastic gradients are computed based on some outdated parameters, yielding what are termed stale gradients. While stale gradients could be directly used in SG-MCMC, their impact on convergence properties has not been well studied. In this paper we develop theory to show that while the bias and MSE of an SG-MCMC algorithm depend on the staleness of stochastic gradients, its estimation variance (relative to the expected estimate, based on a prescribed number of samples) is independent of it. In a simple Bayesian distributed system with SG-MCMC, where stale gradients are computed asynchronously by a set of workers, our theory indicates a linear speedup on the decrease of estimation variance w.r.t. the number of workers. Experiments on synthetic data and deep neural networks validate our theory, demonstrating the effectiveness and scalability of SG-MCMC with stale gradients.