LGCLCHEM-PHSep 18, 2021

Multilingual Molecular Representation Learning via Contrastive Pre-training

arXiv:2109.08830v3637 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of limited molecular representation diversity for cheminformatics researchers, offering a novel multilingual method that is incremental over existing language model approaches.

The paper tackled the limitation of single-language molecular representation learning by proposing MM-Deacon, a multilingual approach using SMILES and IUPAC, which achieved improved performance on seven MoleculeNet tasks, zero-shot retrieval, and drug-drug interaction prediction.

Molecular representation learning plays an essential role in cheminformatics. Recently, language model-based approaches have gained popularity as an alternative to traditional expert-designed features to encode molecules. However, these approaches only utilize a single molecular language for representation learning. Motivated by the fact that a given molecule can be described using different languages such as Simplified Molecular Line Entry System (SMILES), The International Union of Pure and Applied Chemistry (IUPAC), and The IUPAC International Chemical Identifier (InChI), we propose a multilingual molecular embedding generation approach called MM-Deacon (multilingual molecular domain embedding analysis via contrastive learning). MM-Deacon is pre-trained using SMILES and IUPAC as two different languages on large-scale molecules. We evaluated the robustness of our method on seven molecular property prediction tasks from MoleculeNet benchmark, zero-shot cross-lingual retrieval, and a drug-drug interaction prediction task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes