Confound-leakage: Confound Removal in Machine Learning Leads to Leakage
This highlights a critical flaw in standard ML workflows, particularly in epidemiology and medicine, where naive confound removal can lead to misleading results, impacting model deployment and interpretation.
The study found that a common method of removing confounds in machine learning, by regressing them out before applying ML, can bias models and artificially inflate prediction accuracy, as demonstrated by overestimating ADHD prediction accuracy when depression is a confound.
Machine learning (ML) approaches to data analysis are now widely adopted in many fields including epidemiology and medicine. To apply these approaches, confounds must first be removed as is commonly done by featurewise removal of their variance by linear regression before applying ML. Here, we show this common approach to confound removal biases ML models, leading to misleading results. Specifically, this common deconfounding approach can leak information such that what are null or moderate effects become amplified to near-perfect prediction when nonlinear ML approaches are subsequently applied. We identify and evaluate possible mechanisms for such confound-leakage and provide practical guidance to mitigate its negative impact. We demonstrate the real-world importance of confound-leakage by analyzing a clinical dataset where accuracy is overestimated for predicting attention deficit hyperactivity disorder (ADHD) with depression as a confound. Our results have wide-reaching implications for implementation and deployment of ML workflows and beg caution against naïve use of standard confound removal approaches.