LGFeb 3, 2022

Hidden Heterogeneity: When to Choose Similarity-Based Calibration

arXiv:2202.01840v21 citations
AI Analysis

This work addresses the need for more reliable probability predictions in high-stakes decision-making by identifying when local calibration methods are beneficial, though it is incremental as it builds on existing calibration techniques.

The paper tackles the problem of hidden heterogeneity in classifier calibration, where global methods fail to improve accuracy in certain subpopulations, and proposes a quantitative measure for it along with two similarity-weighted calibration methods that adapt locally to each test item, showing that these methods generally exceed global calibration performance given sufficient data.

Trustworthy classifiers are essential to the adoption of machine learning predictions in many real-world settings. The predicted probability of possible outcomes can inform high-stakes decision making, particularly when assessing the expected value of alternative decisions or the risk of bad outcomes. These decisions require well-calibrated probabilities, not just the correct prediction of the most likely class. Black-box classifier calibration methods can improve the reliability of a classifier's output without requiring retraining. However, these methods are unable to detect subpopulations where calibration could also improve prediction accuracy. Such subpopulations are said to exhibit "hidden heterogeneity" (HH), because the original classifier did not detect them. This paper proposes a quantitative measure for HH. It also introduces two similarity-weighted calibration methods that can address HH by adapting locally to each test item: SWC weights the calibration set by similarity to the test item, and SWC-HH explicitly incorporates hidden heterogeneity to filter the calibration set. Experiments show that the improvements in calibration achieved by similarity-based calibration methods correlate with the amount of HH present and, given sufficient calibration data, generally exceed calibration achieved by global methods. HH can therefore serve as a useful diagnostic tool for identifying when local calibration methods would be beneficial.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes