Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning
This work addresses the problem of evaluating pretrained models for molecular representation learning in chemistry and drug design, revealing potential issues with evaluation rigor in existing studies, which is incremental as it builds on prior comparisons.
The study conducted the most extensive comparison of pretrained molecular embedding models to date, evaluating 25 models across 25 datasets, and found that nearly all neural models showed negligible or no improvement over the baseline ECFP molecular fingerprint, with only the CLAMP model performing statistically significantly better.
Pretrained neural networks have attracted significant interest in chemistry and small molecule drug design. Embeddings from these models are widely used for molecular property prediction, virtual screening, and small data learning in molecular chemistry. This study presents the most extensive comparison of such models to date, evaluating 25 models across 25 datasets. Under a fair comparison framework, we assess models spanning various modalities, architectures, and pretraining strategies. Using a dedicated hierarchical Bayesian statistical testing model, we arrive at a surprising result: nearly all neural models show negligible or no improvement over the baseline ECFP molecular fingerprint. Only the CLAMP model, which is also based on molecular fingerprints, performs statistically significantly better than the alternatives. These findings raise concerns about the evaluation rigor in existing studies. We discuss potential causes, propose solutions, and offer practical recommendations.