Statistical comparison of classifiers through Bayesian hierarchical modelling
This work addresses the need for more robust statistical comparisons in machine learning, offering a method that improves accuracy estimation for researchers and practitioners, though it is incremental as it builds on existing Bayesian approaches.
The authors tackled the problem of comparing classifier accuracies by proposing a Bayesian hierarchical model that overcomes shortcomings of traditional null hypothesis significance tests, providing posterior probabilities for practical equivalence or significant differences while reducing estimation error through joint analysis across multiple datasets.
Usually one compares the accuracy of two competing classifiers via null hypothesis significance tests (nhst). Yet the nhst tests suffer from important shortcomings, which can be overcome by switching to Bayesian hypothesis testing. We propose a Bayesian hierarchical model which jointly analyzes the cross-validation results obtained by two classifiers on multiple data sets. It returns the posterior probability of the accuracies of the two classifiers being practically equivalent or significantly different. A further strength of the hierarchical model is that, by jointly analyzing the results obtained on all data sets, it reduces the estimation error compared to the usual approach of averaging the cross-validation results obtained on a given data set.