Domain constraints improve risk prediction when outcome data is missing
This work addresses risk prediction in healthcare where outcome data is incomplete due to human decisions, offering a method to improve accuracy for both tested and untested patients, though it is incremental as it builds on existing Bayesian models with new constraints.
The paper tackles the problem of predicting outcomes when historical decision-making leads to missing data, such as in medical testing, by proposing a Bayesian model with domain constraints like prevalence and expertise constraints. The result shows that these constraints improve parameter inference theoretically and on synthetic data, and in a cancer risk case study, the model predicts diagnoses, captures policies, and identifies test allocation issues.
Machine learning models are often trained to predict the outcome resulting from a human decision. For example, if a doctor decides to test a patient for disease, will the patient test positive? A challenge is that historical decision-making determines whether the outcome is observed: we only observe test outcomes for patients doctors historically tested. Untested patients, for whom outcomes are unobserved, may differ from tested patients along observed and unobserved dimensions. We propose a Bayesian model class which captures this setting. The purpose of the model is to accurately estimate risk for both tested and untested patients. Estimating this model is challenging due to the wide range of possibilities for untested patients. To address this, we propose two domain constraints which are plausible in health settings: a prevalence constraint, where the overall disease prevalence is known, and an expertise constraint, where the human decision-maker deviates from purely risk-based decision-making only along a constrained feature set. We show theoretically and on synthetic data that domain constraints improve parameter inference. We apply our model to a case study of cancer risk prediction, showing that the model's inferred risk predicts cancer diagnoses, its inferred testing policy captures known public health policies, and it can identify suboptimalities in test allocation. Though our case study is in healthcare, our analysis reveals a general class of domain constraints which can improve model estimation in many settings.