LG AIMar 24

Confidence Calibration under Ambiguous Ground Truth

Linwei Tao, Haoyang Luo, Minjing Dong, Chang Xu

arXiv:2603.2287960.4h-index: 1

Predicted impact top 36% in LG · last 90 daysOriginality Highly original

AI Analysis

This addresses calibration issues in machine learning for domains with ambiguous data, such as medical imaging and natural language inference, offering practical solutions for improved reliability.

The paper tackles the problem of confidence calibration when ground truth is ambiguous due to annotator disagreement, showing that standard calibrators are structurally biased and miscalibrated. It introduces ambiguity-aware calibrators that reduce true-label Expected Calibration Error (ECE) by 9-87% across benchmarks without requiring model retraining.

Confidence calibration assumes a unique ground-truth label per input, yet this assumption fails wherever annotators genuinely disagree. Post-hoc calibrators fitted on majority-voted labels, the standard single-label targets used in practice, can appear well-calibrated under conventional evaluation yet remain substantially miscalibrated against the underlying annotator distribution. We show that this failure is structural: under simplifying assumptions, Temperature Scaling is biased toward temperatures that underestimate annotator uncertainty, with true-label miscalibration increasing monotonically with annotation entropy. To address this, we develop a family of ambiguity-aware post-hoc calibrators that optimise proper scoring rules against the full label distribution and require no model retraining. Our methods span progressively weaker annotation requirements: Dirichlet-Soft leverages the full annotator distribution and achieves the best overall calibration quality across settings; Monte Carlo Temperature Scaling with a single annotation per example (MCTS S=1) matches full-distribution calibration across all benchmarks, demonstrating that pre-aggregated label distributions are unnecessary; and Label-Smooth Temperature Scaling (LS-TS) operates with voted labels alone by constructing data-driven pseudo-soft targets from the model's own confidence. Experiments on four benchmarks with real multi-annotator distributions (CIFAR-10H, ChaosNLI) and clinically-informed synthetic annotations (ISIC~2019, DermaMNIST) show that Dirichlet-Soft reduces true-label ECE by 55-87% relative to Temperature Scaling, while LS-TS reduces ECE by 9-77% without any annotator data.

View on arXiv PDF

Similar