LGDec 4, 2025Code
LeMat-GenBench: A Unified Evaluation Framework for Crystal Generative ModelsSiddharth Betala, Samuel P. Gleason, Ali Ramlaoui et al.
Generative machine learning (ML) models hold great promise for accelerating materials discovery through the inverse design of inorganic crystals, enabling an unprecedented exploration of chemical space. Yet, the lack of standardized evaluation frameworks makes it challenging to evaluate, compare, and further develop these ML models meaningfully. In this work, we introduce LeMat-GenBench, a unified benchmark for generative models of crystalline materials, supported by a set of evaluation metrics designed to better inform model development and downstream applications. We release both an open-source evaluation suite and a public leaderboard on Hugging Face, and benchmark 12 recent generative models. Results reveal that an increase in stability leads to a decrease in novelty and diversity on average, with no model excelling across all dimensions. Altogether, LeMat-GenBench establishes a reproducible and extensible foundation for fair model comparison and aims to guide the development of more reliable, discovery-oriented generative models for crystalline materials.
QMNov 4, 2022
De novo PROTAC design using graph-based deep generative modelsDivya Nori, Connor W. Coley, Rocío Mercado
PROteolysis TArgeting Chimeras (PROTACs) are an emerging therapeutic modality for degrading a protein of interest (POI) by marking it for degradation by the proteasome. Recent developments in artificial intelligence (AI) suggest that deep generative models can assist with the de novo design of molecules with desired properties, and their application to PROTAC design remains largely unexplored. We show that a graph-based generative model can be used to propose novel PROTAC-like structures from empty graphs. Our model can be guided towards the generation of large molecules (30--140 heavy atoms) predicted to degrade a POI through policy-gradient reinforcement learning (RL). Rewards during RL are applied using a boosted tree surrogate model that predicts a molecule's degradation potential for each POI. Using this approach, we steer the generative model towards compounds with higher likelihoods of predicted degradation activity. Despite being trained on sparse public data, the generative model proposes molecules with substructures found in known degraders. After fine-tuning, predicted activity against a challenging POI increases from 50% to >80% with near-perfect chemical validity for sampled compounds, suggesting this is a promising approach for the optimization of large, PROTAC-like molecules for targeted protein degradation.
48.5QMMay 11
Beyond Manual Curation: Augmenting Targeted Protein Degradation Databases via Agentic Literature Extraction WorkflowsYaochen Rao, Farzaneh Jalalypour, N. M. Anoop Krishnan et al.
Predictive models in biomedicine depend on structured assay data locked in the text, tables, and supplements of primary publications. This bottleneck is especially acute in targeted protein degradation (TPD), where each assay record must combine compound identity, degradation target, recruiter, assay context, and endpoint values reported across sections, tables, and supplementary files. Inconsistent compound identifiers and incomplete or implicit assay context further demand domain-specific logic that generic LLM pipelines do not provide. Existing molecular glue and PROTAC databases are manually curated and often lack the experimental context required for downstream modeling. We formulate TPD database extraction as a domain-specific curation task and present an expert-in-the-loop LLM workflow, evaluated through a triangular comparison among LLM predictions, standardized baseline records, and expert-annotated ground truth. A lightweight cross-validated prompt-refinement module adapts extraction instructions from scarce expert annotations. With only seven annotated molecular glue publications, the workflow achieved record-level $F_1 = 0.98$ and transferred to PROTACs by terminology substitution alone, maintaining record-level $F_1 > 0.93$. Applied at scale, it expanded molecular glue and PROTAC databases by 81% and 92% records, respectively, with 92% and 82.5% of newly recovered records validated as correct upon expert review. The workflow also recovered kinetic and assay-context information essential for cross-study potency comparison and condition-aware degradation modeling. We release the workflow, prompts, evaluation code, and extracted datasets as resources for TPD data curation and AI-assisted scientific curation more broadly.
42.1LGMay 11
QT-Net: Rethinking Evaluation of AI Models in Atomic Chemical SpacePablo Martínez Crespo, Stefano Ribes, Martin Rahm et al.
Atomic properties such as partial charges or multipoles encode chemically meaningful information that can inform downstream molecular property prediction, but their evaluation as machine learning targets has been complicated by the absence of a principled out-of-distribution evaluation protocol at the atomic level. In this work, we propose a held-out evaluation protocol that clusters atomic environments by SOAP descriptors and computes metrics accounting only for cluster labels unseen during training. Following this procedure, we use 5$\times$5 cross-validation and Tukey's HSD to run a statistically rigorous comparison of E(3)-equivariant against non-equivariant, rotationally augmented models for predicting electron populations and multipoles of H, C, N, and O atoms. Building on our results, we introduce the Quantum Topological Neural Network (QT-Net), a rotationally augmented, non-equivariant graph neural network. We show that QT-Net can be used to infer properties of atoms in molecules from QM9 outside our training set, and that these inferred properties can yield improvement when used as input features for downstream molecular property prediction. To further validate the framework, molecular dipole moments computed from QT-Net's per-atom outputs recover the ground-truth values reported in QM9. We release all code and data, including a JAX implementation of QT-Net, to support the broader use of learned QTA properties as inductive biases for atomic-scale molecular machine learning.
LGFeb 4, 2025
deCIFer: Crystal Structure Prediction from Powder Diffraction Data using Autoregressive Language ModelsFrederik Lizak Johansen, Ulrik Friis-Jensen, Erik Bjørnager Dam et al.
Novel materials drive progress across applications from energy storage to electronics. Automated characterization of material structures with machine learning methods offers a promising strategy for accelerating this key step in material design. In this work, we introduce an autoregressive language model that performs crystal structure prediction (CSP) from powder diffraction data. The presented model, deCIFer, generates crystal structures in the widely used Crystallographic Information File (CIF) format and can be conditioned on powder X-ray diffraction (PXRD) data. Unlike earlier works that primarily rely on high-level descriptors like composition, deCIFer is also able to use diffraction data to perform CSP. We train deCIFer on nearly 2.3M crystal structures and validate on diverse sets of PXRD patterns for characterizing challenging inorganic crystal systems. Qualitative checks and quantitative assessments using the residual weighted profile show that deCIFer produces structures that more accurately match the target diffraction data. Notably, deCIFer can achieve a 94% match rate on test data. deCIFer bridges experimental diffraction data with computational CSP, lending itself as a powerful tool for crystal structure characterization.
QMOct 22, 2025
Compressing Biology: Evaluating the Stable Diffusion VAE for Phenotypic Drug DiscoveryTélio Cropsal, Rocío Mercado
High-throughput phenotypic screens generate vast microscopy image datasets that push the limits of generative models due to their large dimensionality. Despite the growing popularity of general-purpose models trained on natural images for microscopy data analysis, their suitability in this domain has not been quantitatively demonstrated. We present the first systematic evaluation of Stable Diffusion's variational autoencoder (SD-VAE) for reconstructing Cell Painting images, assessing performance across a large dataset with diverse molecular perturbations and cell types. We find that SD-VAE reconstructions preserve phenotypic signals with minimal loss, supporting its use in microscopy workflows. To benchmark reconstruction quality, we compare pixel-level, embedding-based, latent-space, and retrieval-based metrics for a biologically informed evaluation. We show that general-purpose feature extractors like InceptionV3 match or surpass publicly available bespoke models in retrieval tasks, simplifying future pipelines. Our findings offer practical guidelines for evaluating generative models on microscopy data and support the use of off-the-shelf models in phenotypic drug discovery.
BMJun 24, 2024
A Comprehensive Review of Emerging Approaches in Machine Learning for De Novo PROTAC DesignYossra Gharbi, Rocío Mercado
Targeted protein degradation (TPD) is a rapidly growing field in modern drug discovery that aims to regulate the intracellular levels of proteins by harnessing the cell's innate degradation pathways to selectively target and degrade disease-related proteins. This strategy creates new opportunities for therapeutic intervention in cases where occupancy-based inhibitors have not been successful. Proteolysis-targeting chimeras (PROTACs) are at the heart of TPD strategies, which leverage the ubiquitin-proteasome system for the selective targeting and proteasomal degradation of pathogenic proteins. As the field evolves, it becomes increasingly apparent that the traditional methodologies for designing such complex molecules have limitations. This has led to the use of machine learning (ML) and generative modeling to improve and accelerate the development process. In this review, we explore the impact of ML on de novo PROTAC design $-$ an aspect of molecular design that has not been comprehensively reviewed despite its significance. We delve into the distinct characteristics of PROTAC linker design, underscoring the complexities required to create effective bifunctional molecules capable of TPD. We then examine how ML in the context of fragment-based drug design (FBDD), honed in the realm of small-molecule drug discovery, is paving the way for PROTAC linker design. Our review provides a critical evaluation of the limitations inherent in applying this method to the complex field of PROTAC development. Moreover, we review existing ML works applied to PROTAC design, highlighting pioneering efforts and, importantly, the limitations these studies face. By offering insights into the current state of PROTAC development and the integral role of ML in PROTAC design, we aim to provide valuable perspectives for researchers in their pursuit of better design strategies for this new modality.
LGOct 12, 2021
Amortized Tree Generation for Bottom-up Synthesis Planning and Synthesizable Molecular DesignWenhao Gao, Rocío Mercado, Connor W. Coley
Molecular design and synthesis planning are two critical steps in the process of molecular discovery that we propose to formulate as a single shared task of conditional synthetic pathway generation. We report an amortized approach to generate synthetic pathways as a Markov decision process conditioned on a target molecular embedding. This approach allows us to conduct synthesis planning in a bottom-up manner and design synthesizable molecules by decoding from optimized conditional codes, demonstrating the potential to solve both problems of design and synthesis simultaneously. The approach leverages neural networks to probabilistically model the synthetic trees, one reaction step at a time, according to reactivity rules encoded in a discrete action space of reaction templates. We train these networks on hundreds of thousands of artificial pathways generated from a pool of purchasable compounds and a list of expert-curated templates. We validate our method with (a) the recovery of molecules using conditional generation, (b) the identification of synthesizable structural analogs, and (c) the optimization of molecular structures given oracle functions relevant to drug discovery.