Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization
This work addresses the gap between Bayesian and optimization methods for machine learning practitioners, though it appears incremental as it builds on existing SG-MCMC techniques.
The authors tackled the problem of connecting stochastic gradient MCMC methods with stochastic optimization by applying simulated annealing to SG-MCMC and extending it with adaptive preconditioners and element-wise momentum weights, resulting in a novel optimization method that achieved state-of-the-art results on deep neural network models.
Stochastic gradient Markov chain Monte Carlo (SG-MCMC) methods are Bayesian analogs to popular stochastic optimization methods; however, this connection is not well studied. We explore this relationship by applying simulated annealing to an SGMCMC algorithm. Furthermore, we extend recent SG-MCMC methods with two key components: i) adaptive preconditioners (as in ADAgrad or RMSprop), and ii) adaptive element-wise momentum weights. The zero-temperature limit gives a novel stochastic optimization method with adaptive element-wise momentum weights, while conventional optimization methods only have a shared, static momentum weight. Under certain assumptions, our theoretical analysis suggests the proposed simulated annealing approach converges close to the global optima. Experiments on several deep neural network models show state-of-the-art results compared to related stochastic optimization algorithms.