Advancing Ligand-based Virtual Screening and Molecular Generation with Pretrained Molecular Embedding Distance
This work offers a computationally efficient alternative to traditional molecular similarity methods for drug discovery researchers, though it is an incremental application of existing pretrained models.
The authors propose pretrained embedding distance (PED) as a scalable similarity measure for ligand-based drug discovery, showing it correlates with traditional metrics and performs effectively in virtual screening and molecular generation without task-specific training.
Molecular similarity plays a central role in ligand-based drug discovery, such as virtual screening, analog searching, and goal-directed molecular generation. However, traditional similarity measures, ranging from fingerprint-based Tanimoto coefficients to 3D shape overlays, are often computationally expensive at scale or rely on hand-crafted molecular descriptors. Meanwhile, many deep learning approaches to similarity-aware design still depend on similarity-specific supervision or costly data curation, limiting their generality across targets. In this work, we propose pretrained embedding distance (PED) as an effective alternative, computed directly from pretrained molecular models without task-specific training. Experimental results show that PED exhibits distinct correlations with traditional similarity metrics, and performs effectively in both ranking molecules for virtual screening and guiding molecular generation via reward design. These findings suggest that pretrained molecular embeddings capture rich structural information and can serve as a promising and scalable similarity measurement for modern AI-aided drug discovery.