LGNov 20, 2023
Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lensJean Feng, Adarsh Subbaswamy, Alexej Gossmann et al.
After a machine learning (ML)-based system is deployed, monitoring its performance is important to ensure the safety and effectiveness of the algorithm over time. When an ML algorithm interacts with its environment, the algorithm can affect the data-generating mechanism and be a major source of bias when evaluating its standalone performance, an issue known as performativity. Although prior work has shown how to validate models in the presence of performativity using causal inference techniques, there has been little work on how to monitor models in the presence of performativity. Unlike the setting of model validation, there is much less agreement on which performance metrics to monitor. Different monitoring criteria impact how interpretable the resulting test statistic is, what assumptions are needed for identifiability, and the speed of detection. When this choice is further coupled with the decision to use observational versus interventional data, ML deployment teams are faced with a multitude of monitoring options. The aim of this work is to highlight the relatively under-appreciated complexity of designing a monitoring strategy and how causal reasoning can provide a systematic framework for choosing between these options. As a motivating example, we consider an ML-based risk prediction algorithm for predicting unplanned readmissions. Bringing together tools from causal inference and statistical process control, we consider six monitoring procedures (three candidate monitoring criteria and two data sources) and investigate their operating characteristics in simulation studies. Results from this case study emphasize the seemingly simple (and obvious) fact that not all monitoring systems are created equal, which has real-world impacts on the design and documentation of ML monitoring systems.
LGJul 28, 2023
Is this model reliable for everyone? Testing for strong calibrationJean Feng, Alexej Gossmann, Romain Pirracchio et al.
In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup. Such models are reliable across heterogeneous populations and satisfy strong notions of algorithmic fairness. However, the task of auditing a model for strong calibration is well-known to be difficult -- particularly for machine learning (ML) algorithms -- due to the sheer number of potential subgroups. As such, common practice is to only assess calibration with respect to a few predefined subgroups. Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal or where the poorly calibrated subgroup is small, as they either overly subdivide the data or fail to divide the data at all. We introduce a new testing procedure based on the following insight: if we can reorder observations by their expected residuals, there should be a change in the association between the predicted and observed residuals along this sequence if a poorly calibrated subgroup exists. This lets us reframe the problem of calibration testing into one of changepoint detection, for which powerful methods already exist. We begin with introducing a sample-splitting procedure where a portion of the data is used to train a suite of candidate models for predicting the residual, and the remaining data are used to perform a score-based cumulative sum (CUSUM) test. To further improve power, we then extend this adaptive CUSUM test to incorporate cross-validation, while maintaining Type I error control under minimal assumptions. Compared to existing methods, the proposed procedure consistently achieved higher power in simulation studies and more than doubled the power when auditing a mortality risk prediction model.
MLMar 21, 2022
Sequential algorithmic modification with test data reuseJean Feng, Gene Pennello, Nicholas Petrick et al.
After initial release of a machine learning algorithm, the model can be fine-tuned by retraining on subsequently gathered data, adding newly discovered features, or more. Each modification introduces a risk of deteriorating performance and must be validated on a test dataset. It may not always be practical to assemble a new dataset for testing each modification, especially when most modifications are minor or are implemented in rapid succession. Recent works have shown how one can repeatedly test modifications on the same dataset and protect against overfitting by (i) discretizing test results along a grid and (ii) applying a Bonferroni correction to adjust for the total number of modifications considered by an adaptive developer. However, the standard Bonferroni correction is overly conservative when most modifications are beneficial and/or highly correlated. This work investigates more powerful approaches using alpha-recycling and sequentially-rejective graphical procedures (SRGPs). We introduce novel extensions that account for correlation between adaptively chosen algorithmic modifications. In empirical analyses, the SRGPs control the error rate of approving unacceptable modifications and approve a substantially higher number of beneficial modifications than previous approaches.
LGApr 6, 2023
Data AUDIT: Identifying Attribute Utility- and Detectability-Induced Bias in Task ModelsMitchell Pavlak, Nathan Drenkow, Nicholas Petrick et al.
To safely deploy deep learning-based computer vision models for computer-aided detection and diagnosis, we must ensure that they are robust and reliable. Towards that goal, algorithmic auditing has received substantial attention. To guide their audit procedures, existing methods rely on heuristic approaches or high-level objectives (e.g., non-discrimination in regards to protected attributes, such as sex, gender, or race). However, algorithms may show bias with respect to various attributes beyond the more obvious ones, and integrity issues related to these more subtle attributes can have serious consequences. To enable the generation of actionable, data-driven hypotheses which identify specific dataset attributes likely to induce model bias, we contribute a first technique for the rigorous, quantitative screening of medical image datasets. Drawing from literature in the causal inference and information theory domains, our procedure decomposes the risks associated with dataset attributes in terms of their detectability and utility (defined as the amount of information knowing the attribute gives about a task label). To demonstrate the effectiveness and sensitivity of our method, we develop a variety of datasets with synthetically inserted artifacts with different degrees of association to the target label that allow evaluation of inherited model biases via comparison of performance against true counterfactual examples. Using these datasets and results from hundreds of trained models, we show our screening method reliably identifies nearly imperceptible bias-inducing artifacts. Lastly, we apply our method to the natural attributes of a popular skin-lesion dataset and demonstrate its success. Our approach provides a means to perform more systematic algorithmic audits and guide future data collection efforts in pursuit of safer and more reliable models.
MLNov 17, 2022
Monitoring machine learning (ML)-based risk prediction algorithms in the presence of confounding medical interventionsJean Feng, Alexej Gossmann, Gene Pennello et al.
Performance monitoring of machine learning (ML)-based risk prediction models in healthcare is complicated by the issue of confounding medical interventions (CMI): when an algorithm predicts a patient to be at high risk for an adverse event, clinicians are more likely to administer prophylactic treatment and alter the very target that the algorithm aims to predict. A simple approach is to ignore CMI and monitor only the untreated patients, whose outcomes remain unaltered. In general, ignoring CMI may inflate Type I error because (i) untreated patients disproportionally represent those with low predicted risk and (ii) evolution in both the model and clinician trust in the model can induce complex dependencies that violate standard assumptions. Nevertheless, we show that valid inference is still possible if one monitors conditional performance and if either conditional exchangeability or time-constant selection bias hold. Specifically, we develop a new score-based cumulative sum (CUSUM) monitoring procedure with dynamic control limits. Through simulations, we demonstrate the benefits of combining model updating with monitoring and investigate how over-trust in a prediction model may delay detection of performance deterioration. Finally, we illustrate how these monitoring methods can be used to detect calibration decay of an ML-based risk calculator for postoperative nausea and vomiting during the COVID-19 pandemic.
LGApr 16, 2024
TorchSurv: A Lightweight Package for Deep Survival AnalysisMélodie Monod, Peter Krusche, Qian Cao et al.
TorchSurv is a Python package that serves as a companion tool to perform deep survival modeling within the PyTorch environment. Unlike existing libraries that impose specific parametric forms, TorchSurv enables the use of custom PyTorch-based deep survival models. With its lightweight design, minimal input requirements, full PyTorch backend, and freedom from restrictive survival model parameterizations, TorchSurv facilitates efficient deep survival model implementation and is particularly beneficial for high-dimensional and complex input data scenarios.
LGMar 13, 2025
Detecting Dataset Bias in Medical AI: A Generalized and Modality-Agnostic Auditing FrameworkNathan Drenkow, Mitchell Pavlak, Keith Harrigian et al.
Artificial Intelligence (AI) is now firmly at the center of evidence-based medicine. Despite many success stories that edge the path of AI's rise in healthcare, there are comparably many reports of significant shortcomings and unexpected behavior of AI in deployment. A major reason for these limitations is AI's reliance on association-based learning, where non-representative machine learning datasets can amplify latent bias during training and/or hide it during testing. To unlock new tools capable of foreseeing and preventing such AI bias issues, we present G-AUDIT. Generalized Attribute Utility and Detectability-Induced bias Testing (G-AUDIT) for datasets is a modality-agnostic dataset auditing framework that allows for generating targeted hypotheses about sources of bias in training or testing data. Our method examines the relationship between task-level annotations (commonly referred to as ``labels'') and data properties including patient attributes (e.g., age, sex) and environment/acquisition characteristics (e.g., clinical site, imaging protocols). G-AUDIT quantifies the extent to which the observed data attributes pose a risk for shortcut learning, or in the case of testing data, might hide predictions made based on spurious associations. We demonstrate the broad applicability of our method by analyzing large-scale medical datasets for three distinct modalities and machine learning tasks: skin lesion classification in images, stigmatizing language classification in Electronic Health Records (EHR), and mortality prediction for ICU tabular data. In each setting, G-AUDIT successfully identifies subtle biases commonly overlooked by traditional qualitative methods, underscoring its practical value in exposing dataset-level risks and supporting the downstream development of reliable AI systems.