On Improving Graph Neural Networks for QSAR by Pre-training on Extended-Connectivity Fingerprints
For practitioners in drug discovery using GNNs for QSAR, this work offers a pre-training strategy that can improve performance on certain tasks, though it is not universally effective.
The paper proposes pre-training Graph Neural Networks (GNNs) to predict Extended-Connectivity Fingerprints (ECFP) to improve QSAR performance. Across five of six Biogen benchmarks, pre-trained GNNs showed statistically significant improvement over baselines, but underperformed on heterogeneous datasets and binding affinity prediction in out-of-distribution settings.
Molecular Graph Neural Networks (GNNs) are increasingly common in drug discovery, particularly for Quantitative Structure-Activity Relationship (QSAR) studies; yet, their superiority compared to classical molecular featurisation approaches is disputed. We report a general strategy for improving GNNs for QSAR by pre-training to predict Extended-Connectivity Fingerprints (ECFP). We validate our approach with statistical tests and challenging out-of-distribution (OOD) splits. Across five out of six Biogen benchmarks, we observed a statistically significant improvement in standard performance metrics over all evaluated baselines when using ECFP pre-trained GNNs. However, for more heterogeneous datasets and more complex endpoints, such as binding affinity prediction, pre-trained GNNs underperformed in OOD settings. Importantly, we investigated the impact of substructure-level data leakage during pre-training on downstream performance. While we identified scenarios where pre-training on ECFPs was less effective, our findings show that ECFP-based pre-training can enhance downstream OOD performance on a diverse set of practically relevant QSAR tasks.