ML LGFeb 15, 2024

Negative impact of heavy-tailed uncertainty and error distributions on the reliability of calibration statistics for machine learning regression tasks

arXiv:2402.10043v53 citationsh-index: 3

Originality Synthesis-oriented

AI Analysis

This addresses reliability issues in uncertainty quantification for ML practitioners, but it is incremental as it focuses on improving existing statistical methods.

The study found that heavy-tailed uncertainty and error distributions in machine learning regression tasks make common calibration statistics like calibration error (CE) unreliable, while the mean squared z-scores (ZMS) statistic is more robust, though still requiring caution with heavy-tailed data.

Average calibration of the (variance-based) prediction uncertainties of machine learning regression tasks can be tested in two ways: one is to estimate the calibration error (CE) as the difference between the mean absolute error (MSE) and the mean variance (MV); the alternative is to compare the mean squared z-scores (ZMS) to 1. The problem is that both approaches might lead to different conclusions, as illustrated in this study for an ensemble of datasets from the recent machine learning uncertainty quantification (ML-UQ) literature. It is shown that the estimation of MV, MSE and their confidence intervals becomes unreliable for heavy-tailed uncertainty and error distributions, which seems to be a frequent feature of ML-UQ datasets. By contrast, the ZMS statistic is less sensitive and offers the most reliable approach in this context, still acknowledging that datasets with heavy-tailed z-scores distributions should be considered with great care. Unfortunately, the same problem is expected to affect also conditional calibrations statistics, such as the popular ENCE, and very likely post-hoc calibration methods based on similar statistics. Several solutions to circumvent the outlined problems are proposed.

View on arXiv PDF

Similar