LG BMJul 13, 2022

Does GNN Pretraining Help Molecular Representation?

arXiv:2207.06010v226.692 citationsh-index: 8

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of optimizing AI-driven drug discovery for researchers by showing that pretraining may be incremental or unnecessary in many molecular settings.

The study investigated whether self-supervised pretraining improves molecular representation with graph neural networks, finding that it often provides negligible or statistically insignificant benefits compared to non-pretraining methods, with improvements diminishing under certain conditions like richer features or balanced data splits.

Extracting informative representations of molecules using Graph neural networks (GNNs) is crucial in AI-driven drug discovery. Recently, the graph research community has been trying to replicate the success of self-supervised pretraining in natural language processing, with several successes claimed. However, we find the benefit brought by self-supervised pretraining on small molecular data can be negligible in many cases. We conduct thorough ablation studies on the key components of GNN pretraining, including pretraining objectives, data splitting methods, input features, pretraining dataset scales, and GNN architectures, to see how they affect the accuracy of the downstream tasks. Our first important finding is, self-supervised graph pretraining do not always have statistically significant advantages over non-pretraining methods in many settings. Secondly, although noticeable improvement can be observed with additional supervised pretraining, the improvement may diminish with richer features or more balanced data splits. Thirdly, hyper-parameters could have larger impacts on accuracy of downstream tasks than the choice of pretraining tasks, especially when the scales of downstream tasks are small. Finally, we provide our conjectures where the complexity of some pretraining methods on small molecules might be insufficient, followed by empirical evidences on different pretraining datasets.

View on arXiv PDF

Similar