Marwin Segler

LG
h-index69
16papers
537citations
Novelty57%
AI Score43

16 Papers

AIJan 31, 2023Code
Retrosynthetic Planning with Dual Value Networks

Guoqing Liu, Di Xue, Shufang Xie et al.

Retrosynthesis, which aims to find a route to synthesize a target molecule from commercially available starting materials, is a critical task in drug discovery and materials design. Recently, the combination of ML-based single-step reaction predictors with multi-step planners has led to promising results. However, the single-step predictors are mostly trained offline to optimize the single-step accuracy, without considering complete routes. Here, we leverage reinforcement learning (RL) to improve the single-step predictor, by using a tree-shaped MDP to optimize complete routes. Specifically, we propose a novel online training algorithm, called Planning with Dual Value Networks (PDVN), which alternates between the planning phase and updating phase. In PDVN, we construct two separate value networks to predict the synthesizability and cost of molecules, respectively. To maintain the single-step accuracy, we design a two-branch network structure for the single-step predictor. On the widely-used USPTO dataset, our PDVN algorithm improves the search success rate of existing multi-step planners (e.g., increasing the success rate from 85.79% to 98.95% for Retro*, and reducing the number of model calls by half while solving 99.47% molecules for RetroGraph). Additionally, PDVN helps find shorter synthesis routes (e.g., reducing the average route length from 5.76 to 4.83 for Retro*, and from 5.63 to 4.78 for RetroGraph). Our code is available at \url{https://github.com/DiXue98/PDVN}.

QMAug 30, 2023
RetroBridge: Modeling Retrosynthesis with Markov Bridges

Ilia Igashov, Arne Schneuing, Marwin Segler et al.

Retrosynthesis planning is a fundamental challenge in chemistry which aims at designing reaction pathways from commercially available starting materials to a target molecule. Each step in multi-step retrosynthesis planning requires accurate prediction of possible precursor molecules given the target molecule and confidence estimates to guide heuristic search algorithms. We model single-step retrosynthesis planning as a distribution learning problem in a discrete state space. First, we introduce the Markov Bridge Model, a generative framework aimed to approximate the dependency between two intractable discrete distributions accessible via a finite sample of coupled data points. Our framework is based on the concept of a Markov bridge, a Markov process pinned at its endpoints. Unlike diffusion-based methods, our Markov Bridge Model does not need a tractable noise distribution as a sampling proxy and directly operates on the input product molecules as samples from the intractable prior distribution. We then address the retrosynthesis planning problem with our novel framework and introduce RetroBridge, a template-free retrosynthesis modeling approach that achieves state-of-the-art results on standard evaluation benchmarks.

LGOct 30, 2023
Re-evaluating Retrosynthesis Algorithms with Syntheseus

Krzysztof Maziarz, Austin Tripp, Guoqing Liu et al.

Automated Synthesis Planning has recently re-emerged as a research area at the intersection of chemistry and machine learning. Despite the appearance of steady progress, we argue that imperfect benchmarks and inconsistent comparisons mask systematic shortcomings of existing techniques, and unnecessarily hamper progress. To remedy this, we present a synthesis planning library with an extensive benchmarking framework, called syntheseus, which promotes best practice by default, enabling consistent meaningful evaluation of single-step models and multi-step planning algorithms. We demonstrate the capabilities of syntheseus by re-evaluating several previous retrosynthesis algorithms, and find that the ranking of state-of-the-art models changes in controlled evaluation experiments. We end with guidance for future works in this area, and call the community to engage in the discussion on how to improve benchmarks for synthesis planning.

AIOct 13, 2023
Retro-fallback: retrosynthetic planning in an uncertain world

Austin Tripp, Krzysztof Maziarz, Sarah Lewis et al.

Retrosynthesis is the task of planning a series of chemical reactions to create a desired molecule from simpler, buyable molecules. While previous works have proposed algorithms to find optimal solutions for a range of metrics (e.g. shortest, lowest-cost), these works generally overlook the fact that we have imperfect knowledge of the space of possible reactions, meaning plans created by algorithms may not work in a laboratory. In this paper we propose a novel formulation of retrosynthesis in terms of stochastic processes to account for this uncertainty. We then propose a novel greedy algorithm called retro-fallback which maximizes the probability that at least one synthesis plan can be executed in the lab. Using in-silico benchmarks we demonstrate that retro-fallback generally produces better sets of synthesis plans than the popular MCTS and retro* algorithms.

LGDec 15, 2025
A Scientific Reasoning Model for Organic Synthesis Procedure Generation

Guoqing Liu, Junren Li, Zihan Zhao et al.

Solving computer-aided synthesis planning is essential for enabling fully automated, robot-assisted synthesis workflows and improving the efficiency of drug discovery. A key challenge, however, is bridging the gap between computational route design and practical laboratory execution, particularly the accurate prediction of viable experimental procedures for each synthesis step. In this work, we present QFANG, a scientific reasoning language model capable of generating precise, structured experimental procedures directly from reaction equations, with explicit chain-of-thought reasoning. To develop QFANG, we curated a high-quality dataset comprising 905,990 chemical reactions paired with structured action sequences, extracted and processed from patent literature using large language models. We introduce a Chemistry-Guided Reasoning (CGR) framework that produces chain-of-thought data grounded in chemical knowledge at scale. The model subsequently undergoes supervised fine-tuning to elicit complex chemistry reasoning. Finally, we apply Reinforcement Learning from Verifiable Rewards (RLVR) to further enhance procedural accuracy. Experimental results demonstrate that QFANG outperforms advanced general-purpose reasoning models and nearest-neighbor retrieval baselines, measured by traditional NLP similarity metrics and a chemically aware evaluator using an LLM-as-a-judge. Moreover, QFANG generalizes to certain out-of-domain reaction classes and adapts to variations in laboratory conditions and user-specific constraints. We believe that QFANG's ability to generate high-quality synthesis procedures represents an important step toward bridging the gap between computational synthesis planning and fully automated laboratory synthesis.

LGMay 2, 2024
SynFlowNet: Design of Diverse and Novel Molecules with Synthesis Constraints

Miruna Cretu, Charles Harris, Ilia Igashov et al.

Generative models see increasing use in computer-aided drug design. However, while performing well at capturing distributions of molecular motifs, they often produce synthetically inaccessible molecules. To address this, we introduce SynFlowNet, a GFlowNet model whose action space uses chemical reactions and purchasable reactants to sequentially build new molecules. By incorporating forward synthesis as an explicit constraint of the generative mechanism, we aim at bridging the gap between in silico molecular generation and real world synthesis capabilities. We evaluate our approach using synthetic accessibility scores and an independent retrosynthesis tool to assess the synthesizability of our compounds, and motivate the choice of GFlowNets through considerable improvement in sample diversity compared to baselines. Additionally, we identify challenges with reaction encodings that can complicate traversal of the MDP in the backward direction. To address this, we introduce various strategies for learning the GFlowNet backward policy and thus demonstrate how additional constraints can be integrated into the GFlowNet MDP framework. This approach enables our model to successfully identify synthesis pathways for previously unseen molecules.

AIFeb 11, 2025
Nature Language Model: Deciphering the Language of Nature for Scientific Discovery

Yingce Xia, Peiran Jin, Shufang Xie et al. · microsoft-research

Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, RNA and even cells. However, these models are typically trained in isolation, lacking the ability to integrate across different scientific domains. Recognizing that entities within these domains can all be represented as sequences, which together form the "language of nature", we introduce Nature Language Model (NatureLM), a sequence-based science foundation model designed for scientific discovery. Pre-trained with data from multiple scientific domains, NatureLM offers a unified, versatile model that enables various applications including: (i) generating and optimizing small molecules, proteins, RNA, and materials using text instructions; (ii) cross-domain generation/design, such as protein-to-molecule and protein-to-RNA generation; and (iii) top performance across different domains, matching or surpassing state-of-the-art specialist models. NatureLM offers a promising generalist approach for various scientific tasks, including drug discovery (hit generation/optimization, ADMET optimization, synthesis), novel material design, and the development of therapeutic proteins or nucleotides. We have developed NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion parameters) and observed a clear improvement in performance as the model size increases.

BMMay 2, 2024
Generative Active Learning for the Search of Small-molecule Protein Binders

Maksym Korablyov, Cheng-Hao Liu, Moksh Jain et al. · mila

Despite substantial progress in machine learning for scientific discovery in recent years, truly de novo design of small molecules which exhibit a property of interest remains a significant challenge. We introduce LambdaZero, a generative active learning approach to search for synthesizable molecules. Powered by deep reinforcement learning, LambdaZero learns to search over the vast space of molecules to discover candidates with a desired property. We apply LambdaZero with molecular docking to design novel small molecules that inhibit the enzyme soluble Epoxide Hydrolase 2 (sEH), while enforcing constraints on synthesizability and drug-likeliness. LambdaZero provides an exponential speedup in terms of the number of calls to the expensive molecular docking oracle, and LambdaZero de novo designed molecules reach docking scores that would otherwise require the virtual screening of a hundred billion molecules. Importantly, LambdaZero discovers novel scaffolds of synthesizable, drug-like inhibitors for sEH. In in vitro experimental validation, a series of ligands from a generated quinazoline-based scaffold were synthesized, and the lead inhibitor N-(4,6-di(pyrrolidin-1-yl)quinazolin-2-yl)-N-methylbenzamide (UM0152893) displayed sub-micromolar enzyme inhibition of sEH.

LGMar 9, 2025
UniGenX: a unified generative foundation model that couples sequence, structure and function to accelerate scientific design across proteins, molecules and materials

Gongbo Zhang, Yanting Li, Renqian Luo et al. · microsoft-research

Function in natural systems arises from one-dimensional sequences forming three-dimensional structures with specific properties. However, current generative models suffer from critical limitations: training objectives seldom target function directly, discrete sequences and continuous coordinates are optimized in isolation, and conformational ensembles are under-modeled. We present UniGenX, a unified generative foundation model that addresses these gaps by co-generating sequences and coordinates under direct functional and property objectives across proteins, molecules, and materials. UniGenX represents heterogeneous inputs as a mixed stream of symbolic and numeric tokens, where a decoder-only autoregressive transformer provides global context and a conditional diffusion head generates numeric fields steered by task-specific tokens. Besides the new high SOTAs on structure prediction tasks, the model demonstrates state-of-the-art or competitive performance for the function-aware generation across domains: in materials, it achieves "conflicted" multi-property conditional generation, yielding 436 crystal candidates meeting triple constraints, including 11 with novel compositions; in chemistry, it sets new benchmarks on five property targets and conformer ensemble generation on GEOM; and in biology, it improves success in modeling protein induced fit (RMSD < 2 Å) by over 23-fold and enhances EC-conditioned enzyme design. Ablation studies and cross-domain transfer substantiate the benefits of joint discrete-continuous training, establishing UniGenX as a significant advance from prediction to controllable, function-aware generation.

LGDec 6, 2024
Chemist-aligned retrosynthesis by ensembling diverse inductive bias models

Krzysztof Maziarz, Guoqing Liu, Hubert Misztela et al.

Chemical synthesis remains a critical bottleneck in the discovery and manufacture of functional small molecules. AI-based synthesis planning models could be a potential remedy to find effective syntheses, and have made progress in recent years. However, they still struggle with less frequent, yet critical reactions for synthetic strategy, as well as hallucinated, incorrect predictions. This hampers multi-step search algorithms that rely on models, and leads to misalignment with chemists' expectations. Here we propose RetroChimera: a frontier retrosynthesis model, built upon two newly developed components with complementary inductive biases, which we fuse together using a new framework for integrating predictions from multiple sources via a learning-based ensembling strategy. Through experiments across several orders of magnitude in data scale and splitting strategy, we show RetroChimera outperforms all major models by a large margin, demonstrating robustness outside the training data, as well as for the first time the ability to learn from even a very small number of examples per reaction class. Moreover, industrial organic chemists prefer predictions from RetroChimera over the reactions it was trained on in terms of quality, revealing high levels of alignment. Finally, we demonstrate zero-shot transfer to an internal dataset from a major pharmaceutical company, showing robust generalization under distribution shift. With the new dimension that our ensembling framework unlocks, we anticipate further acceleration in the development of even more accurate models.

LGJun 26, 2024
RetroGFN: Diverse and Feasible Retrosynthesis using GFlowNets

Piotr Gaiński, Michał Koziarski, Krzysztof Maziarz et al.

Single-step retrosynthesis aims to predict a set of reactions that lead to the creation of a target molecule, which is a crucial task in molecular discovery. Although a target molecule can often be synthesized with multiple different reactions, it is not clear how to verify the feasibility of a reaction, because the available datasets cover only a tiny fraction of the possible solutions. Consequently, the existing models are not encouraged to explore the space of possible reactions sufficiently. In this paper, we propose a novel single-step retrosynthesis model, RetroGFN, that can explore outside the limited dataset and return a diverse set of feasible reactions by leveraging a feasibility proxy model during the training. We show that RetroGFN achieves competitive results on standard top-k accuracy while outperforming existing methods on round-trip accuracy. Moreover, we provide empirical arguments in favor of using round-trip accuracy, which expands the notion of feasibility with respect to the standard top-k accuracy metric.

LGMay 4, 2023
Are VAEs Bad at Reconstructing Molecular Graphs?

Hagen Muenkler, Hubert Misztela, Michal Pikusa et al.

Many contemporary generative models of molecules are variational auto-encoders of molecular graphs. One term in their training loss pertains to reconstructing the input, yet reconstruction capabilities of state-of-the-art models have not yet been thoroughly compared on a large and chemically diverse dataset. In this work, we show that when several state-of-the-art generative models are evaluated under the same conditions, their reconstruction accuracy is surprisingly low, worse than what was previously reported on seemingly harder datasets. However, we show that improving reconstruction does not directly lead to better sampling or optimization performance. Failed reconstructions from the MoLeR model are usually similar to the inputs, assembling the same motifs in a different way, and possess similar chemical properties such as solubility. Finally, we show that the input molecule and its failed reconstruction are usually mapped by the different encoders to statistically distinguishable posterior distributions, hinting that posterior collapse may not fully explain why VAEs are bad at reconstructing molecular graphs.

LGApr 7, 2021
Modern Hopfield Networks for Few- and Zero-Shot Reaction Template Prediction

Philipp Seidl, Philipp Renz, Natalia Dyubankova et al.

Finding synthesis routes for molecules of interest is an essential step in the discovery of new drugs and materials. To find such routes, computer-assisted synthesis planning (CASP) methods are employed which rely on a model of chemical reactivity. In this study, we model single-step retrosynthesis in a template-based approach using modern Hopfield networks (MHNs). We adapt MHNs to associate different modalities, reaction templates and molecules, which allows the model to leverage structural information about reaction templates. This approach significantly improves the performance of template relevance prediction, especially for templates with few or zero training examples. With inference speed several times faster than that of baseline methods, we improve predictive performance for top-k exact match accuracy for $\mathrm{k}\geq5$ in the retrosynthesis benchmark USPTO-50k.

LGMar 5, 2021
Learning to Extend Molecular Scaffolds with Structural Motifs

Krzysztof Maziarz, Henry Jackson-Flux, Pashmina Cameron et al.

Recent advancements in deep learning-based modeling of molecules promise to accelerate in silico drug discovery. A plethora of generative models is available, building molecules either atom-by-atom and bond-by-bond or fragment-by-fragment. However, many drug discovery projects require a fixed scaffold to be present in the generated molecule, and incorporating that constraint has only recently been explored. Here, we propose MoLeR, a graph-based model that naturally supports scaffolds as initial seed of the generative procedure, which is possible because it is not conditioned on the generation history. Our experiments show that MoLeR performs comparably to state-of-the-art methods on unconstrained molecular optimization tasks, and outperforms them on scaffold-based tasks, while being an order of magnitude faster to train and sample from than existing approaches. Furthermore, we show the influence of a number of seemingly minor design choices on the overall performance.

LGNov 26, 2020
Molecular representation learning with language models and domain-relevant auxiliary tasks

Benedek Fabian, Thomas Edlich, Héléna Gaspar et al.

We apply a Transformer architecture, specifically BERT, to learn flexible and high quality molecular representations for drug discovery problems. We study the impact of using different combinations of self-supervised tasks for pre-training, and present our results for the established Virtual Screening and QSAR benchmarks. We show that: i) The selection of appropriate self-supervised task(s) for pre-training has a significant impact on performance in subsequent downstream tasks such as Virtual Screening. ii) Using auxiliary tasks with more domain relevance for Chemistry, such as learning to predict calculated molecular properties, increases the fidelity of our learnt representations. iii) Finally, we show that molecular representations learnt by our model `MolBert' improve upon the current state of the art on the benchmark datasets.

AIJan 31, 2017
Towards "AlphaChem": Chemical Synthesis Planning with Tree Search and Deep Neural Network Policies

Marwin Segler, Mike Preuß, Mark P. Waller

Retrosynthesis is a technique to plan the chemical synthesis of organic molecules, for example drugs, agro- and fine chemicals. In retrosynthesis, a search tree is built by analysing molecules recursively and dissecting them into simpler molecular building blocks until one obtains a set of known building blocks. The search space is intractably large, and it is difficult to determine the value of retrosynthetic positions. Here, we propose to model retrosynthesis as a Markov Decision Process. In combination with a Deep Neural Network policy learned from essentially the complete published knowledge of chemistry, Monte Carlo Tree Search (MCTS) can be used to evaluate positions. In exploratory studies, we demonstrate that MCTS with neural network policies outperforms the traditionally used best-first search with hand-coded heuristics.