LG MEDec 19, 2023

The curious case of the test set AUROC

Michael Roberts, Alon Hazan, Sören Dittmer, James H. F. Rudd, Carola-Bibiane Schönlieb

arXiv:2312.16188v12.09 citationsh-index: 49Has CodeNat Mach Intell

Originality Synthesis-oriented

AI Analysis

This addresses a methodological gap in ML evaluation, but it is incremental as it critiques existing practices without proposing a new solution.

The paper argues that using AUROC or sensitivity/specificity from test data alone provides limited insight into model performance and generalization, despite the growth in model complexity.

Whilst the size and complexity of ML models have rapidly and significantly increased over the past decade, the methods for assessing their performance have not kept pace. In particular, among the many potential performance metrics, the ML community stubbornly continues to use (a) the area under the receiver operating characteristic curve (AUROC) for a validation and test cohort (distinct from training data) or (b) the sensitivity and specificity for the test data at an optimal threshold determined from the validation ROC. However, we argue that considering scores derived from the test ROC curve alone gives only a narrow insight into how a model performs and its ability to generalise.

View on arXiv PDF Code

Similar