LGBMDec 7, 2024

SMI-Editor: Edit-based SMILES Language Model with Fragment-level Supervision

arXiv:2412.05569v26 citationsh-index: 30ICLR
Originality Incremental advance
AI Analysis

This addresses a bottleneck in molecular representation learning for computational chemistry and drug discovery, offering an incremental but effective improvement over existing SMILES-based methods.

The paper tackles the problem that existing SMILES language models for molecular structures rely on single-token supervision and corrupted inputs, limiting their ability to capture substructural information. The proposed SMI-Editor model uses an edit-based approach with fragment-level supervision and valid SMILES inputs, achieving state-of-the-art performance on multiple downstream molecular tasks and outperforming some 3D models.

SMILES, a crucial textual representation of molecular structures, has garnered significant attention as a foundation for pre-trained language models (LMs). However, most existing pre-trained SMILES LMs focus solely on the single-token level supervision during pre-training, failing to fully leverage the substructural information of molecules. This limitation makes the pre-training task overly simplistic, preventing the models from capturing richer molecular semantic information. Moreover, during pre-training, these SMILES LMs only process corrupted SMILES inputs, never encountering any valid SMILES, which leads to a train-inference mismatch. To address these challenges, we propose SMI-Editor, a novel edit-based pre-trained SMILES LM. SMI-Editor disrupts substructures within a molecule at random and feeds the resulting SMILES back into the model, which then attempts to restore the original SMILES through an editing process. This approach not only introduces fragment-level training signals, but also enables the use of valid SMILES as inputs, allowing the model to learn how to reconstruct complete molecules from these incomplete structures. As a result, the model demonstrates improved scalability and an enhanced ability to capture fragment-level molecular information. Experimental results show that SMI-Editor achieves state-of-the-art performance across multiple downstream molecular tasks, and even outperforming several 3D molecular representation models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes