Forecast Evaluation and the Relationship of Regret and Calibration
This work addresses the need for standardized evaluation metrics in machine learning forecasting, offering a foundational framework that could impact all of ML/AI by clarifying metric relationships, though it is incremental in building on existing concepts.
The paper tackles the problem of evaluating machine learning forecasts by proposing a general framework that unifies various evaluation metrics, such as regret and calibration scores, under a two-dimensional hierarchy based on a fairness criterion. It shows that while regret-type and calibration-type metrics are theoretically equivalent in their evaluation ability, their scores are practically incomparable.
Machine learning is about forecasting. When the forecasts come with an evaluation metric the forecasts become useful. What are reasonable evaluation metrics? How do existing evaluation metrics relate? In this work, we provide a general structure which subsumes many currently used evaluation metrics in a two-dimensional hierarchy, e.g., external and swap regret, loss scores, and calibration scores. The framework embeds those evaluation metrics in a large set of single-instance-based comparisons of forecasts and observations which respect a meta-criterion for reasonable forecast evaluations which we term ``fairness''. In particular, this framework sheds light on the relationship on regret-type and calibration-type evaluation metrics showing a theoretical equivalence in their ability to evaluate, but practical incomparability of the obtained scores.