LGMLSep 6, 2019

Master your Metrics with Calibration

arXiv:1909.02827v244 citations
AI Analysis

This addresses the challenge of reliable performance evaluation in real-world applications such as model monitoring and fairness, though it is incremental as it builds on existing metric calibration concepts.

The paper tackles the problem of interpreting machine learning model performance across different subpopulations or time periods by proposing a method to calibrate precision-based metrics like F1-score to be invariant to class prior, showing improved interpretability and control in experiments on balanced and imbalanced data.

Machine learning models deployed in real-world applications are often evaluated with precision-based metrics such as F1-score or AUC-PR (Area Under the Curve of Precision Recall). Heavily dependent on the class prior, such metrics make it difficult to interpret the variation of a model's performance over different subpopulations/subperiods in a dataset. In this paper, we propose a way to calibrate the metrics so that they can be made invariant to the prior. We conduct a large number of experiments on balanced and imbalanced data to assess the behavior of calibrated metrics and show that they improve interpretability and provide a better control over what is really measured. We describe specific real-world use-cases where calibration is beneficial such as, for instance, model monitoring in production, reporting, or fairness evaluation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes