ML LGJul 21, 2021

Differentiable Annealed Importance Sampling and the Perils of Gradient Noise

Guodong Zhang, Kyle Hsu, Jianing Li, Chelsea Finn, Roger Grosse

arXiv:2107.10211v220.844 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of scalable Bayesian inference for large datasets, but highlights a critical limitation in stochastic settings, indicating incremental progress with a novel analysis.

The authors tackled the problem of making annealed importance sampling (AIS) differentiable for gradient-based optimization of marginal likelihood, proposing Differentiable AIS (DAIS) by removing Metropolis-Hastings corrections, and found that while it works in full-batch settings, a stochastic variant with mini-batch gradients can be arbitrarily bad due to incompatibility between convergence and error elimination.

Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation, but are not fully differentiable due to the use of Metropolis-Hastings correction steps. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective using gradient-based methods. To this end, we propose Differentiable AIS (DAIS), a variant of AIS which ensures differentiability by abandoning the Metropolis-Hastings corrections. As a further advantage, DAIS allows for mini-batch gradients. We provide a detailed convergence analysis for Bayesian linear regression which goes beyond previous analyses by explicitly accounting for the sampler not having reached equilibrium. Using this analysis, we prove that DAIS is consistent in the full-batch setting and provide a sublinear convergence rate. Furthermore, motivated by the problem of learning from large-scale datasets, we study a stochastic variant of DAIS that uses mini-batch gradients. Surprisingly, stochastic DAIS can be arbitrarily bad due to a fundamental incompatibility between the goals of last-iterate convergence to the posterior and elimination of the accumulated stochastic error. This is in stark contrast with other settings such as gradient-based optimization and Langevin dynamics, where the effect of gradient noise can be washed out by taking smaller steps. This indicates that annealing-based marginal likelihood estimation with stochastic gradients may require new ideas.

View on arXiv PDF

Similar