MLLGJul 26, 2025

Bag of Coins: A Statistical Probe into Neural Confidence Structures

arXiv:2507.19774v1h-index: 3
Originality Incremental advance
AI Analysis

This addresses reliability issues in high-stakes applications for AI practitioners by providing a diagnostic tool, though it is incremental as it builds on existing calibration methods.

The paper tackled the problem of poorly calibrated confidence scores in neural networks by introducing the Bag-of-Coins test, a statistical probe that reframes confidence estimation as a hypothesis test to examine internal consistency; it achieved near-perfect calibration on Vision Transformers with an ECE of 0.0212, an 88% improvement over baselines, and revealed inconsistencies in CNNs like ResNet.

Modern neural networks, despite their high accuracy, often produce poorly calibrated confidence scores, limiting their reliability in high-stakes applications. Existing calibration methods typically post-process model outputs without interrogating the internal consistency of the predictions themselves. In this work, we introduce a novel, non-parametric statistical probe, the Bag-of-Coins (BoC) test, that examines the internal consistency of a classifier's logits. The BoC test reframes confidence estimation as a frequentist hypothesis test: does the model's top-ranked class win 1-v-1 contests against random competitors at a rate consistent with its own stated softmax probability? When applied to modern deep learning architectures, this simple probe reveals a fundamental dichotomy. On Vision Transformers (ViTs), the BoC output serves as a state-of-the-art confidence score, achieving near-perfect calibration with an ECE of 0.0212, an 88% improvement over a temperature-scaled baseline. Conversely, on Convolutional Neural Networks (CNNs) like ResNet, the probe reveals a deep inconsistency between the model's predictions and its internal logit structure, a property missed by traditional metrics. We posit that BoC is not merely a calibration method, but a new diagnostic tool for understanding and exposing the differing ways that popular architectures represent uncertainty.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes