LG AIFeb 25, 2025

Thinking like a CHEMIST: Combined Heterogeneous Embedding Model Integrating Structure and Tokens

Nikolai Rekut, Alexey Orlov, Klea Ziu, Elizaveta Starykh, Martin Takac, Aleksandr Beznosikov

arXiv:2502.17986v24.1h-index: 18

Originality Incremental advance

AI Analysis

This addresses limitations in chemical compound representation for researchers in computational chemistry and drug discovery, though it is incremental as it builds on existing language and graph models.

The study tackled the challenge of representing molecular structures in chemistry by decomposing molecules into substructures with descriptor-based representations and integrating these with language and graph-based models, achieving notable improvements in tasks like QSAR prediction.

Representing molecular structures effectively in chemistry remains a challenging task. Language models and graph-based models are extensively utilized within this domain, consistently achieving state-of-the-art results across an array of tasks. However, the prevailing practice of representing chemical compounds in the SMILES format - used by most data sets and many language models - presents notable limitations as a training data format. In this study, we present a novel approach that decomposes molecules into substructures and computes descriptor-based representations for these fragments, providing more detailed and chemically relevant input for model training. We use this substructure and descriptor data as input for language model and also propose a bimodal architecture that integrates this language model with graph-based models. As LM we use RoBERTa, Graph Isomorphism Networks (GIN), Graph Convolutional Networks (GCN) and Graphormer as graph ones. Our framework shows notable improvements over traditional methods in various tasks such as Quantitative Structure-Activity Relationship (QSAR) prediction.

View on arXiv PDF

Similar