LG AI CHEM-PH BMNov 5, 2024

Two-Stage Pretraining for Molecular Property Prediction in the Wild

Kevin Tirta Wijaya, Minghao Guo, Michael Sun, Hans-Peter Seidel, Wojciech Matusik, Vahid Babaei

arXiv:2411.03537v22.6h-index: 80

Originality Incremental advance

AI Analysis

This addresses the challenge of expensive and time-consuming laboratory experimentation for molecular property prediction in real-world applications, representing a strong domain-specific advancement.

The paper tackles the problem of scarce labeled data for molecular property prediction by introducing MoleVers, a two-stage pretrained model that achieves state-of-the-art performance on 22 small, experimentally-validated datasets.

Molecular deep learning models have achieved remarkable success in property prediction, but they often require large amounts of labeled data. The challenge is that, in real-world applications, labels are extremely scarce, as obtaining them through laboratory experimentation is both expensive and time-consuming. In this work, we introduce MoleVers, a versatile pretrained molecular model designed for various types of molecular property prediction in the wild, i.e., where experimentally-validated labels are scarce. MoleVers employs a two-stage pretraining strategy. In the first stage, it learns molecular representations from unlabeled data through masked atom prediction and extreme denoising, a novel task enabled by our newly introduced branching encoder architecture and dynamic noise scale sampling. In the second stage, the model refines these representations through predictions of auxiliary properties derived from computational methods, such as the density functional theory or large language models. Evaluation on 22 small, experimentally-validated datasets demonstrates that MoleVers achieves state-of-the-art performance, highlighting the effectiveness of its two-stage framework in producing generalizable molecular representations for diverse downstream properties.

View on arXiv PDF

Similar