LGAIBMSep 23, 2025

A Foundation Chemical Language Model for Comprehensive Fragment-Based Drug Discovery

arXiv:2509.19586v11 citationsh-index: 22
Originality Synthesis-oriented
AI Analysis

This work addresses the need for comprehensive chemical space coverage in drug discovery, though it is incremental as it builds on existing GPT-2 methods applied to a new dataset.

The paper tackles the problem of fragment-based drug discovery by introducing FragAtlas-62M, a foundation model trained on 62 million molecules, which generates 99.90% chemically valid fragments and produces 22% novel structures with practical relevance.

We introduce FragAtlas-62M, a specialized foundation model trained on the largest fragment dataset to date. Built on the complete ZINC-22 fragment subset comprising over 62 million molecules, it achieves unprecedented coverage of fragment chemical space. Our GPT-2 based model (42.7M parameters) generates 99.90% chemically valid fragments. Validation across 12 descriptors and three fingerprint methods shows generated fragments closely match the training distribution (all effect sizes < 0.4). The model retains 53.6% of known ZINC fragments while producing 22% novel structures with practical relevance. We release FragAtlas-62M with training code, preprocessed data, documentation, and model weights to accelerate adoption.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes