LG AI CLMar 5, 2025

Transformers for molecular property prediction: Domain adaptation efficiently improves performance

Afnan Sultan, Max Rausch-Dupont, Shahrukh Khan, Olga Kalinina, Dietrich Klakow, Andrea Volkamer

arXiv:2503.03360v311.45 citationsh-index: 2Has Code

Originality Incremental advance

AI Analysis

This work addresses efficiency and performance challenges in drug discovery by showing that targeted domain adaptation can enhance transformer models for molecular property prediction, though it is incremental.

The study found that increasing pre-training dataset size beyond 400K-800K molecules does not improve molecular property prediction across seven ADME datasets, but domain adaptation with a small domain-specific dataset (≤4K molecules) significantly boosts performance (P<0.001), making it comparable to larger models like MolBERT.

Over the past six years, molecular transformer models have become key tools in drug discovery. Most existing models are pre-trained on large, unlabeled datasets such as ZINC or ChEMBL. However, the extent to which large-scale pre-training improves molecular property prediction remains unclear. This study evaluates transformer models for this task while addressing their limitations. We explore how pre-training dataset size and chemically informed objectives impact performance. Our results show that increasing the dataset beyond approximately 400K to 800K molecules from large-scale unlabeled databases does not enhance performance across seven datasets covering five ADME endpoints: lipophilicity, permeability, solubility (two datasets), microsomal stability (two datasets), and plasma protein binding. In contrast, domain adaptation on a small, domain-specific dataset (less than or equal 4K molecules) using multi-task regression of physicochemical properties significantly boosts performance (P-value less than 0.001). A model pre-trained on 400K molecules and adapted with domain-specific data outperforms larger models such as MolFormer and performs comparably to MolBERT. Benchmarks against Random Forest (RF) baselines using descriptors and Morgan fingerprints show that chemically and physically informed features consistently yield better performance across model types. While RF remains a strong baseline, we identify concrete practices to enhance transformer performance. Aligning pre-training and adaptation with chemically meaningful tasks and domain-relevant data presents a promising direction for molecular property prediction. Our models are available on HuggingFace for easy use and adaptation.

View on arXiv PDF Code

Similar