Chemical Language Models for Natural Products: A State-Space Model Approach
This work addresses the problem of modeling Natural Products for drug discovery, offering incremental improvements in generation and prediction tasks for this domain.
The paper tackled the underexplored area of Natural Products (NPs) in drug discovery by developing NP-specific chemical language models using state-space models (Mamba and Mamba-2) and comparing them with transformers, finding that Mamba generates 1-2% more valid and unique molecules and outperforms GPT by 0.02-0.04 MCC in property prediction under random splits.
Language models are widely used in chemistry for molecular property prediction and small-molecule generation, yet Natural Products (NPs) remain underexplored despite their importance in drug discovery. To address this gap, we develop NP-specific chemical language models (NPCLMs) by pre-training state-space models (Mamba and Mamba-2) and comparing them with transformer baselines (GPT). Using a dataset of about 1M NPs, we present the first systematic comparison of selective state-space models and transformers for NP-focused tasks, together with eight tokenization strategies including character-level, Atom-in-SMILES (AIS), byte-pair encoding (BPE), and NP-specific BPE. We evaluate molecule generation (validity, uniqueness, novelty) and property prediction (membrane permeability, taste, anti-cancer activity) using MCC and AUC-ROC. Mamba generates 1-2 percent more valid and unique molecules than Mamba-2 and GPT, with fewer long-range dependency errors, while GPT yields slightly more novel structures. For property prediction, Mamba variants outperform GPT by 0.02-0.04 MCC under random splits, while scaffold splits show comparable performance. Results demonstrate that domain-specific pre-training on about 1M NPs can match models trained on datasets over 100 times larger.