Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations
This work addresses the challenge of capturing molecular complexity for applications like property prediction, offering a novel integration approach that enhances performance over existing benchmarks.
The paper tackles the problem of molecular representation learning by introducing GODE, a method that integrates molecular graph structures with knowledge graph data using contrastive learning, resulting in significant improvements in property prediction tasks, such as a 12.7% average ROC-AUC increase for classification and 34.4% average RMSE/MAE reduction for regression.
Molecular representation learning is vital for various downstream applications, including the analysis and prediction of molecular properties and side effects. While Graph Neural Networks (GNNs) have been a popular framework for modeling molecular data, they often struggle to capture the full complexity of molecular representations. In this paper, we introduce a novel method called GODE, which accounts for the dual-level structure inherent in molecules. Molecules possess an intrinsic graph structure and simultaneously function as nodes within a broader molecular knowledge graph. GODE integrates individual molecular graph representations with multi-domain biochemical data from knowledge graphs. By pre-training two GNNs on different graph structures and employing contrastive learning, GODE effectively fuses molecular structures with their corresponding knowledge graph substructures. This fusion yields a more robust and informative representation, enhancing molecular property predictions by leveraging both chemical and biological information. When fine-tuned across 11 chemical property tasks, our model significantly outperforms existing benchmarks, achieving an average ROC-AUC improvement of 12.7% for classification tasks and an average RMSE/MAE improvement of 34.4% for regression tasks. Notably, GODE surpasses the current leading model in property prediction, with advancements of 2.2% in classification and 7.2% in regression tasks.