On the Importance of Strong Baselines in Bayesian Deep Learning
This exposes an experimental oversight in Bayesian deep learning, emphasizing the need for consistent benchmarking to ensure fair comparisons, which is incremental but important for the field's rigor.
The paper identifies a flaw in Bayesian deep learning experiments where models trained to convergence are compared to baselines trained for a fixed number of iterations, showing that Monte Carlo dropout, when evaluated under identical settings, significantly improves and outperforms or matches methods previously claimed superior.
Like all sub-fields of machine learning Bayesian Deep Learning is driven by empirical validation of its theoretical proposals. Given the many aspects of an experiment it is always possible that minor or even major experimental flaws can slip by both authors and reviewers. One of the most popular experiments used to evaluate approximate inference techniques is the regression experiment on UCI datasets. However, in this experiment, models which have been trained to convergence have often been compared with baselines trained only for a fixed number of iterations. We find that a well-established baseline, Monte Carlo dropout, when evaluated under the same experimental settings shows significant improvements. In fact, the baseline outperforms or performs competitively with methods that claimed to be superior to the very same baseline method when they were introduced. Hence, by exposing this flaw in experimental procedure, we highlight the importance of using identical experimental setups to evaluate, compare, and benchmark methods in Bayesian Deep Learning.