On Local Posterior Structure in Deep Ensembles
This work addresses the calibration and uncertainty quantification problem in machine learning, revealing counterintuitive trade-offs between in-distribution and out-of-distribution performance for practitioners using ensemble methods.
The paper investigates deep ensembles of Bayesian Neural Networks (DE-BNNs) and finds that, contrary to expectations, large deep ensembles (DEs) consistently outperform DE-BNNs on in-distribution data, while DE-BNNs show better out-of-distribution performance at the cost of in-distribution accuracy.
Bayesian Neural Networks (BNNs) often improve model calibration and predictive uncertainty quantification compared to point estimators such as maximum-a-posteriori (MAP). Similarly, deep ensembles (DEs) are also known to improve calibration, and therefore, it is natural to hypothesize that deep ensembles of BNNs (DE-BNNs) should provide even further improvements. In this work, we systematically investigate this across a number of datasets, neural network architectures, and BNN approximation methods and surprisingly find that when the ensembles grow large enough, DEs consistently outperform DE-BNNs on in-distribution data. To shine light on this observation, we conduct several sensitivity and ablation studies. Moreover, we show that even though DE-BNNs outperform DEs on out-of-distribution metrics, this comes at the cost of decreased in-distribution performance. As a final contribution, we open-source the large pool of trained models to facilitate further research on this topic.