LGCHEM-PHJun 18, 2025

Descriptor-based Foundation Models for Molecular Property Prediction

MIT
arXiv:2506.15792v17 citationsh-index: 3Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of accurate and scalable molecular property prediction for scientific domains like drug discovery, though it is incremental as it builds on existing foundation model and descriptor-based approaches.

The study tackled molecular property prediction by introducing CheMeleon, a foundation model pre-trained on deterministic molecular descriptors, which achieved a 79% win rate on Polaris tasks and 97% on MoleculeACE assays, outperforming baselines like Random Forest and Chemprop.

Fast and accurate prediction of molecular properties with machine learning is pivotal to scientific advancements across myriad domains. Foundation models in particular have proven especially effective, enabling accurate training on small, real-world datasets. This study introduces CheMeleon, a novel molecular foundation model pre-trained on deterministic molecular descriptors from the Mordred package, leveraging a Directed Message-Passing Neural Network to predict these descriptors in a noise-free setting. Unlike conventional approaches relying on noisy experimental data or biased quantum mechanical simulations, CheMeleon uses low-noise molecular descriptors to learn rich molecular representations. Evaluated on 58 benchmark datasets from Polaris and MoleculeACE, CheMeleon achieves a win rate of 79% on Polaris tasks, outperforming baselines like Random Forest (46%), fastprop (39%), and Chemprop (36%), and a 97% win rate on MoleculeACE assays, surpassing Random Forest (63%) and other foundation models. However, it struggles to distinguish activity cliffs like many of the tested models. The t-SNE projection of CheMeleon's learned representations demonstrates effective separation of chemical series, highlighting its ability to capture structural nuances. These results underscore the potential of descriptor-based pre-training for scalable and effective molecular property prediction, opening avenues for further exploration of descriptor sets and unlabeled datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes