AICLLGQMMay 18, 2025

mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

arXiv:2505.12565v23 citationsh-index: 84
Originality Highly original
AI Analysis

This work addresses the challenge of generating functional and makeable molecules for drug discovery, representing an incremental advance by introducing a novel tokenization approach to improve synthesizability and property prediction.

The paper tackled the problem of generating novel molecules with desired functions and synthesizability by proposing mCLM, a modular chemical language model that tokenizes molecules at the functional building block level, achieving improvements in synthetic accessibility and property scores over baselines, including outperforming 7 other methods and rescuing failed drug candidates.

Despite their ability to understand chemical knowledge, large language models (LLMs) remain limited in their capacity to propose novel molecules with desired functions (e.g., drug-like properties). In addition, the molecules that LLMs propose can often be challenging to make, and are almost never compatible with automated synthesis approaches. To better enable the discovery of functional small molecules, LLMs need to learn a new molecular language that is more effective in predicting properties and inherently synced with automated synthesis technology. Current molecule LLMs are limited by representing molecules based on atoms. In this paper, we argue that just like tokenizing texts into meaning-bearing (sub-)word tokens instead of characters, molecules should be tokenized at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as effective building blocks for real-world automated laboratory synthesis. This motivates us to propose mCLM, a modular Chemical-Language Model that comprises a bilingual language model that understands both natural language descriptions of functions and molecular blocks. mCLM front-loads synthesizability considerations while improving the predicted functions of molecules in a principled manner. mCLM, with only 3B parameters, achieves improvements in synthetic accessibility relative to 7 other leading generative AI methods including GPT-5. When tested on 122 out-of-distribution medicines using only building blocks/tokens that are compatible with automated modular synthesis, mCLM outperforms all baselines in property scores and synthetic accessibility. mCLM can also reason on multiple functions and iteratively self-improve to rescue drug candidates that failed late in clinical trials ("fallen angels").

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes