Which Leakage Types Matter?
This work identifies which types of data leakage most impact reported performance in ML, challenging textbook emphasis and highlighting selection leakage as the primary concern for practitioners.
The study measured the severity of four data leakage classes in machine learning across thousands of datasets, finding that normalization leakage is negligible (ΔAUC ≤ 0.005), while selection leakage, such as peeking and cherry-picking, accounts for ~90% of noise exploitation that inflates scores, and memorization scales with model capacity from d_z = 0.37 to 1.11.
Twenty-eight within-subject counterfactual experiments across 2,047 tabular datasets, plus a boundary experiment on 129 temporal datasets, measuring the severity of four data leakage classes in machine learning. Class I (estimation - fitting scalers on full data) is negligible: all nine conditions produce $|Î\text{AUC}| \leq 0.005$. Class II (selection - peeking, seed cherry-picking) is substantial: ~90% of the measured effect is noise exploitation that inflates reported scores. Class III (memorization) scales with model capacity: d_z = 0.37 (Naive Bayes) to 1.11 (Decision Tree). Class IV (boundary) is invisible under random CV. The textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.