The Geometry of Activity Cliffs: Representation Dependence and Multi-Scale Characterization of Activity Landscapes

arXiv:2605.3083140.7h-index: 4
Predicted impact top 37% in QM · last 90 daysOriginality Incremental advance
AI Analysis

This research is significant for computational chemists and drug discovery researchers, as it challenges the conventional understanding of activity cliffs and highlights the critical role of molecular representation in their characterization, potentially leading to more informed choices in virtual screening and lead optimization.

This paper investigates the nature of activity cliffs, which are structurally similar compounds with large potency differences, arguing that their understanding is heavily influenced by the chosen molecular representation rather than being an intrinsic property. The authors developed a six-step pipeline and applied it to fifteen configurations of embeddings and metrics across three datasets, finding that no single representation excels across all criteria.

Activity cliffs, structurally similar compounds with large potency differences, are widely treated as intrinsic features of chemical datasets. We argue that apart from target biology, much of our cliff understanding is a consequence of the geometry induced by the chosen molecular representation, not a property of a molecule pair itself. We designed a six-step pipeline to systematically test this hypothesis. The pipeline consists of: assessing pairwise distance geometry, cliff enrichment, activity gradient distribution, persistent homology of the cliff subspace, predictive benchmarking for a chosen pair of an embedding and a metric, and eventually, analysis of the matched molecular pairs and stereoisomers. We applied the pipeline to fifteen configurations of embeddings and metrics to build a benchmark across three distinctive datasets known of activity cliffs challenges. No representation excels on all criteria: Morgan Tanimoto provides the strongest cliff enrichment and cross-scaffold generalization; MolFormer cosine provides the only meaningful stereochemical sensitivity; MACCS and RDKit Dice fingerprints are most sensitive to matched-molecular-pair transformations; ChemBERTa fails uniformly due to embedding collapse. These findings are not a ranking. They reflect the fact that different representations encode different aspects of molecular recognition, and that choosing one implicitly defines what an activity cliff actually is.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes