MLDIS-NNLGApr 8, 2019

Bayesian Neural Networks at Finite Temperature

arXiv:1904.04154v15 citations
Originality Incremental advance
AI Analysis

This work addresses generalization issues in neural networks for practitioners, offering incremental improvements through temperature-based sampling and model selection techniques.

The paper tackles the problem of improving generalization in Bayesian neural networks by introducing finite temperature sampling from the posterior, showing that this approach reduces test error compared to standard optimization, with specific examples on MNIST data demonstrating optimal performance at T>0. It also proposes an early stopping criterion for simulated annealing and uses thermodynamic integration for model selection without requiring Hessian inversion.

We recapitulate the Bayesian formulation of neural network based classifiers and show that, while sampling from the posterior does indeed lead to better generalisation than is obtained by standard optimisation of the cost function, even better performance can in general be achieved by sampling finite temperature ($T$) distributions derived from the posterior. Taking the example of two different deep (3 hidden layers) classifiers for MNIST data, we find quite different $T$ values to be appropriate in each case. In particular, for a typical neural network classifier a clear minimum of the test error is observed at $T>0$. This suggests an early stopping criterion for full batch simulated annealing: cool until the average validation error starts to increase, then revert to the parameters with the lowest validation error. As $T$ is increased classifiers transition from accurate classifiers to classifiers that have higher training error than assigning equal probability to each class. Efficient studies of these temperature-induced effects are enabled using a replica-exchange Hamiltonian Monte Carlo simulation technique. Finally, we show how thermodynamic integration can be used to perform model selection for deep neural networks. Similar to the Laplace approximation, this approach assumes that the posterior is dominated by a single mode. Crucially, however, no assumption is made about the shape of that mode and it is not required to precisely compute and invert the Hessian.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes