Calibration through the Lens of Interpretability
This work addresses the need for interpretable and verifiable calibration metrics for researchers and practitioners, though it is incremental as it builds on existing calibration concepts.
The paper tackled the problem of defining and evaluating calibration in machine learning models, proposing an axiomatic framework to catalog desirable properties and analyzing their feasibility and correspondences, with empirical results showing that a simple decision tree can achieve competitive calibration performance.
Calibration is a frequently invoked concept when useful label probability estimates are required on top of classification accuracy. A calibrated model is a function whose values correctly reflect underlying label probabilities. Calibration in itself however does not imply classification accuracy, nor human interpretable estimates, nor is it straightforward to verify calibration from finite data. There is a plethora of evaluation metrics (and loss functions) that each assess a specific aspect of a calibration model. In this work, we initiate an axiomatic study of the notion of calibration. We catalogue desirable properties of calibrated models as well as corresponding evaluation metrics and analyze their feasibility and correspondences. We complement this analysis with an empirical evaluation, comparing common calibration methods to employing a simple, interpretable decision tree.