LGMEDec 19, 2023

The curious case of the test set AUROC

arXiv:2312.16188v19 citationsh-index: 49Nat Mach Intell
Originality Synthesis-oriented
AI Analysis

This addresses a methodological gap in ML evaluation, but it is incremental as it critiques existing practices without proposing a new solution.

The paper argues that using AUROC or sensitivity/specificity from test data alone provides limited insight into model performance and generalization, despite the growth in model complexity.

Whilst the size and complexity of ML models have rapidly and significantly increased over the past decade, the methods for assessing their performance have not kept pace. In particular, among the many potential performance metrics, the ML community stubbornly continues to use (a) the area under the receiver operating characteristic curve (AUROC) for a validation and test cohort (distinct from training data) or (b) the sensitivity and specificity for the test data at an optimal threshold determined from the validation ROC. However, we argue that considering scores derived from the test ROC curve alone gives only a narrow insight into how a model performs and its ability to generalise.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes