LGDATA-ANFeb 21, 2023

Does the evaluation stand up to evaluation? A first-principle approach to the evaluation of classifiers

arXiv:2302.12006v17 citationsh-index: 9
Originality Highly original
AI Analysis

This work addresses a foundational problem in machine learning evaluation, proposing a shift to decision-theoretic principles to improve reliability across all classifier assessments.

The paper argues that current evaluation metrics for machine-learning classifiers are flawed because they are not grounded in decision theory, leading to avoidable errors in real-world applications. It demonstrates that popular metrics like precision, F1, and AUC are never optimal, causing a larger fraction of incorrect evaluations than even moderately mis-specified decision-theoretic metrics.

How can one meaningfully make a measurement, if the meter does not conform to any standard and its scale expands or shrinks depending on what is measured? In the present work it is argued that current evaluation practices for machine-learning classifiers are affected by this kind of problem, leading to negative consequences when classifiers are put to real use; consequences that could have been avoided. It is proposed that evaluation be grounded on Decision Theory, and the implications of such foundation are explored. The main result is that every evaluation metric must be a linear combination of confusion-matrix elements, with coefficients - "utilities" - that depend on the specific classification problem. For binary classification, the space of such possible metrics is effectively two-dimensional. It is shown that popular metrics such as precision, balanced accuracy, Matthews Correlation Coefficient, Fowlkes-Mallows index, F1-measure, and Area Under the Curve are never optimal: they always give rise to an in-principle avoidable fraction of incorrect evaluations. This fraction is even larger than would be caused by the use of a decision-theoretic metric with moderately wrong coefficients.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes