Classifier uncertainty: evidence, potential impact, and probabilistic treatment
This addresses the issue of unreliable performance evaluation in machine learning for researchers and practitioners, though it is incremental as it builds on existing probabilistic models.
The paper tackles the problem of uncertain performance metrics in classifiers due to small test datasets, presenting a probabilistic method to quantify this uncertainty using confusion matrices, which reveals that uncertainties can be large and some published classifiers may be misleading.
Classifiers are often tested on relatively small data sets, which should lead to uncertain performance metrics. Nevertheless, these metrics are usually taken at face value. We present an approach to quantify the uncertainty of classification performance metrics, based on a probability model of the confusion matrix. Application of our approach to classifiers from the scientific literature and a classification competition shows that uncertainties can be surprisingly large and limit performance evaluation. In fact, some published classifiers are likely to be misleading. The application of our approach is simple and requires only the confusion matrix. It is agnostic of the underlying classifier. Our method can also be used for the estimation of sample sizes that achieve a desired precision of a performance metric.