LGApr 25

Unstable Rankings in Bayesian Deep Learning Evaluation

Qishi Zhan, Minxuan Hu, Guansu Wang, Jiaxin Liu, Liang He

arXiv:2604.2310242.3

Predicted impact top 60% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners evaluating Bayesian deep learning methods, this work highlights the need for uncertainty-aware evaluation in low-data settings to avoid misleading conclusions.

The paper shows that standard Bayesian deep learning evaluations produce unreliable method rankings under data scarcity, with dataset-dependent behavior and no universal sample size threshold. It proposes a Bayesian hierarchical model with method-specific variances and a predictive Minimum Detectable Difference curve to assess whether evaluation data is sufficient before concluding method superiority.

Standard evaluations of Bayesian deep learning methods assume that metric estimates are reliable, but we show this assumption fails under data scarcity. Method rankings are not only unreliable at small $n$, but also dataset-dependent in ways that point estimates cannot reveal: the same method comparison yields $P(\mathrm{MCD} \prec \mathrm{Ensemble}) = 1.000$ at $n = 50$ on one dataset and remains below $0.95$ even at $n = 500$ on another. Across the datasets we consider, no universal sample size threshold exists, which is precisely why dataset-specific posterior inference is necessary. To address this, we use a Bayesian hierarchical model with method-specific variances to treat evaluation metrics as random variables across data realizations, and we use a predictive Minimum Detectable Difference curve to assess whether an observed gap would be detectable at a given training size. Across six Bayesian deep learning methods and five regression datasets, our results show that uncertainty-aware evaluation is necessary in low-data settings, because current evidence for method superiority and predictive detectability at the same training size can diverge substantially. Our framework provides practitioners with principled tools to determine whether their evaluation data is sufficient before drawing conclusions about method superiority.

View on arXiv PDF

Similar