Fast Leave-One-Out Approximation from Fragment-Target Prevalence Vectors (molFTP) : From Dummy Masking to Key-LOO for Leakage-Free Feature Construction
This addresses feature leakage issues in molecular machine learning, offering a practical solution for researchers in cheminformatics, though it is incremental as it builds on existing cross-validation methods.
The paper tackled the problem of feature leakage in molecular representation for predictive modeling by introducing molFTP, a compact representation that prevents leakage using dummy masking and approximates leave-one-out cross-validation with key-LOO, achieving deviations below 8% on their datasets.
We introduce molFTP (molecular fragment-target prevalence), a compact representation that delivers strong predictive performance. To prevent feature leakage across cross-validation folds, we implement a dummy-masking procedure that removes information about fragments present in the held-out molecules. We further show that key leave-one-out (key-loo) closely approximates true molecule-level leave-one-out (LOO), with deviation below 8% on our datasets. This enables near full data training while preserving unbiased cross-validation estimates of model performance. Overall, molFTP provides a fast, leakage-resistant fragment-target prevalence vectorization with practical safeguards (dummy masking or key-LOO) that approximate LOO at a fraction of its cost.