ML LG MEMay 11, 2025

Outperformance Score: A Universal Standardization Method for Confusion-Matrix-Based Classification Performance Metrics

Ningsheng Zhao, Trang Bui, Jia Yuan Yu, Krzysztof Dzieciolowski

arXiv:2505.07033v14.51 citationsh-index: 2

Originality Incremental advance

AI Analysis

This provides a solution for researchers and practitioners needing to interpret and evaluate classification performances consistently when test set imbalance rates differ, though it is incremental as it standardizes existing metrics rather than introducing new ones.

The paper tackles the problem of comparing classification performance metrics across varying class imbalances by introducing the outperformance score, a universal standardization method that maps any confusion-matrix-based metric to a common scale [0,1], enabling meaningful comparisons as demonstrated on real-world datasets.

Many classification performance metrics exist, each suited to a specific application. However, these metrics often differ in scale and can exhibit varying sensitivity to class imbalance rates in the test set. As a result, it is difficult to use the nominal values of these metrics to interpret and evaluate classification performances, especially when imbalance rates vary. To address this problem, we introduce the outperformance score function, a universal standardization method for confusion-matrix-based classification performance (CMBCP) metrics. It maps any given metric to a common scale of $[0,1]$, while providing a clear and consistent interpretation. Specifically, the outperformance score represents the percentile rank of the observed classification performance within a reference distribution of possible performances. This unified framework enables meaningful comparison and monitoring of classification performance across test sets with differing imbalance rates. We illustrate how the outperformance scores can be applied to a variety of commonly used classification performance metrics and demonstrate the robustness of our method through experiments on real-world datasets spanning multiple classification applications.

View on arXiv PDF

Similar