MLLGOct 29, 2025

Monitoring the calibration of probability forecasts with an application to concept drift detection involving image classification

arXiv:2510.25573v1h-index: 9
Originality Incremental advance
AI Analysis

This addresses the need for continuous calibration monitoring in deployed machine learning models, especially for image classification in industry and defense, though it is incremental as it builds on existing calibration assessment methods.

The paper tackles the problem of monitoring and detecting loss of calibration in probability forecasts over time, particularly for image classification models, by proposing a cumulative sum-based approach with dynamic limits that enables early detection of miscalibration in process monitoring and concept drift applications.

Machine learning approaches for image classification have led to impressive advances in that field. For example, convolutional neural networks are able to achieve remarkable image classification accuracy across a wide range of applications in industry, defense, and other areas. While these machine learning models boast impressive accuracy, a related concern is how to assess and maintain calibration in the predictions these models make. A classification model is said to be well calibrated if its predicted probabilities correspond with the rates events actually occur. While there are many available methods to assess machine learning calibration and recalibrate faulty predictions, less effort has been spent on developing approaches that continually monitor predictive models for potential loss of calibration as time passes. We propose a cumulative sum-based approach with dynamic limits that enable detection of miscalibration in both traditional process monitoring and concept drift applications. This enables early detection of operational context changes that impact image classification performance in the field. The proposed chart can be used broadly in any situation where the user needs to monitor probability predictions over time for potential lapses in calibration. Importantly, our method operates on probability predictions and event outcomes and does not require under-the-hood access to the machine learning model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes