Edward O. Pyzer-Knapp

ML
h-index23
13papers
350citations
Novelty50%
AI Score31

13 Papers

QMMay 3, 2022
Self-focusing virtual screening with active design space pruning

David E. Graff, Matteo Aldeghi, Joseph A. Morrone et al.

High-throughput virtual screening is an indispensable technique utilized in the discovery of small molecules. In cases where the library of molecules is exceedingly large, the cost of an exhaustive virtual screen may be prohibitive. Model-guided optimization has been employed to lower these costs through dramatic increases in sample efficiency compared to random selection. However, these techniques introduce new costs to the workflow through the surrogate model training and inference steps. In this study, we propose an extension to the framework of model-guided optimization that mitigates inferences costs using a technique we refer to as design space pruning (DSP), which irreversibly removes poor-performing candidates from consideration. We study the application of DSP to a variety of optimization tasks and observe significant reductions in overhead costs while exhibiting similar performance to the baseline optimization. DSP represents an attractive extension of model-guided optimization that can limit overhead costs in optimization settings where these costs are non-negligible relative to objective costs, such as docking.

CHEM-PHAug 15, 2024
Hessian QM9: A quantum chemistry database of molecular Hessians in implicit solvents

Nicholas J. Williams, Lara Kabalan, Ljiljana Stojanovic et al.

A significant challenge in computational chemistry is developing approximations that accelerate \emph{ab initio} methods while preserving accuracy. Machine learning interatomic potentials (MLIPs) have emerged as a promising solution for constructing atomistic potentials that can be transferred across different molecular and crystalline systems. Most MLIPs are trained only on energies and forces in vacuum, while an improved description of the potential energy surface could be achieved by including the curvature of the potential energy surface. We present Hessian QM9, the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the $ω$B97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as water, tetrahydrofuran, and toluene using an implicit solvation model. To demonstrate the utility of this dataset, we show that incorporating second derivatives of the potential energy surface into the loss function of a MLIP significantly improves the prediction of vibrational frequencies in all solvent environments, thus making this dataset extremely useful for studying organic molecules in realistic solvent environments for experimental characterization.

MLAug 11, 2022
A Principled Method for the Creation of Synthetic Multi-fidelity Data Sets

Clyde Fare, Peter Fenner, Edward O. Pyzer-Knapp

Multifidelity and multioutput optimisation algorithms are of active interest in many areas of computational design as they allow cheaper computational proxies to be used intelligently to aid experimental searches for high-performing species. Characterisation of these algorithms involves benchmarks that typically either use analytic functions or existing multifidelity datasets. However, analytic functions are often not representative of relevant problems, while preexisting datasets do not allow systematic investigation of the influence of characteristics of the lower fidelity proxies. To bridge this gap, we present a methodology for systematic generation of synthetic fidelities derived from preexisting datasets. This allows for the construction of benchmarks that are both representative of practical optimisation problems while also allowing systematic investigation of the influence of the lower fidelity proxies.

LGFeb 19, 2025
Refining embeddings with fill-tuning: data-efficient generalised performance improvements for materials foundation models

Matthew P. Wilson, Edward O. Pyzer-Knapp, Nicolas Galichet et al.

Pretrained foundation models learn embeddings that can be used for a wide range of downstream tasks. These embeddings optimise general performance, and if insufficiently accurate at a specific task the model can be fine-tuned to improve performance. For all current methodologies this operation necessarily degrades performance on all out-of-distribution tasks. In this work we present 'fill-tuning', a novel methodology to generate datasets for continued pretraining of foundation models that are not suited to a particular downstream task, but instead aim to correct poor regions of the embedding. We present the application of roughness analysis to latent space topologies and illustrate how it can be used to propose data that will be most valuable to improving the embedding. We apply fill-tuning to a set of state-of-the-art materials foundation models trained on $O(10^9)$ data points and show model improvement of almost 1% in all downstream tasks with the addition of only 100 data points. This method provides a route to the general improvement of foundation models at the computational cost of fine-tuning.

LGMay 18, 2023
Physics Inspired Approaches To Understanding Gaussian Processes

Maximilian P. Niroomand, Luke Dicks, Edward O. Pyzer-Knapp et al.

Prior beliefs about the latent function to shape inductive biases can be incorporated into a Gaussian Process (GP) via the kernel. However, beyond kernel choices, the decision-making process of GP models remains poorly understood. In this work, we contribute an analysis of the loss landscape for GP models using methods from physics. We demonstrate $ν$-continuity for Matern kernels and outline aspects of catastrophe theory at critical points in the loss landscape. By directly including $ν$ in the hyperparameter optimisation for Matern kernels, we find that typical values of $ν$ are far from optimal in terms of performance, yet prevail in the literature due to the increased computational speed. We also provide an a priori method for evaluating the effect of GP ensembles and discuss various voting approaches based on physical properties of the loss landscape. The utility of these approaches is demonstrated for various synthetic and real datasets. Our findings provide an enhanced understanding of the decision-making process behind GPs and offer practical guidance for improving their performance and interpretability in a range of applications.

BMMay 11, 2020
Using Bayesian Optimization to Accelerate Virtual Screening for the Discovery of Therapeutics Appropriate for Repurposing for COVID-19

Edward O. Pyzer-Knapp

The novel Wuhan coronavirus known as SARS-CoV-2 has brought almost unprecedented effects for a non-wartime setting, hitting social, economic and health systems hard.~ Being able to bring to bear pharmaceutical interventions to counteract its effects will represent a major turning point in the fight to turn the tides in this ongoing battle.~ Recently, the World's most powerful supercomputer, SUMMIT, was used to identify existing small molecule pharmaceuticals which may have the desired activity against SARS-CoV-2 through a high throughput virtual screening approach. In this communication, we demonstrate how the use of Bayesian optimization can provide a valuable service for the prioritisation of these calculations, leading to the accelerated identification of high-performing candidates, and thus expanding the scope of the utility of HPC systems for time critical screening

CRJan 28, 2020
Privacy-Preserving Gaussian Process Regression -- A Modular Approach to the Application of Homomorphic Encryption

Peter Fenner, Edward O. Pyzer-Knapp

Much of machine learning relies on the use of large amounts of data to train models to make predictions. When this data comes from multiple sources, for example when evaluation of data against a machine learning model is offered as a service, there can be privacy issues and legal concerns over the sharing of data. Fully homomorphic encryption (FHE) allows data to be computed on whilst encrypted, which can provide a solution to the problem of data privacy. However, FHE is both slow and restrictive, so existing algorithms must be manipulated to make them work efficiently under the FHE paradigm. Some commonly used machine learning algorithms, such as Gaussian process regression, are poorly suited to FHE and cannot be manipulated to work both efficiently and accurately. In this paper, we show that a modular approach, which applies FHE to only the sensitive steps of a workflow that need protection, allows one party to make predictions on their data using a Gaussian process regression model built from another party's data, without either party gaining access to the other's data, in a way which is both accurate and efficient. This construction is, to our knowledge, the first example of an effectively encrypted Gaussian process.

LGNov 8, 2019
Fully Bayesian Recurrent Neural Networks for Safe Reinforcement Learning

Matt Benatan, Edward O. Pyzer-Knapp

Reinforcement Learning (RL) has demonstrated state-of-the-art results in a number of autonomous system applications, however many of the underlying algorithms rely on black-box predictions. This results in poor explainability of the behaviour of these systems, raising concerns as to their use in safety-critical applications. Recent work has demonstrated that uncertainty-aware models exhibit more cautious behaviours through the incorporation of model uncertainty estimates. In this work, we build on Probabilistic Backpropagation to introduce a fully Bayesian Recurrent Neural Network architecture. We apply this within a Safe RL scenario, and demonstrate that the proposed method significantly outperforms a popular approach for obtaining model uncertainties in collision avoidance tasks. Furthermore, we demonstrate that the proposed approach requires less training and is far more efficient than the current leading method, both in terms of compute resource and memory footprint.

CHEM-PHSep 17, 2018
Powerful, transferable representations for molecules through intelligent task selection in deep multitask networks

Clyde Fare, Lukas Turcani, Edward O. Pyzer-Knapp

Chemical representations derived from deep learning are emerging as a powerful tool in areas such as drug discovery and materials innovation. Currently, this methodology has three major limitations - the cost of representation generation, risk of inherited bias, and the requirement for large amounts of data. We propose the use of multi-task learning in tandem with transfer learning to address these limitations directly. In order to avoid introducing unknown bias into multi-task learning through the task selection itself, we calculate task similarity through pairwise task affinity, and use this measure to programmatically select tasks. We test this methodology on several real-world data sets to demonstrate its potential for execution in complex and low-data environments. Finally, we utilise the task similarity to further probe the expressiveness of the learned representation through a comparison to a commonly used cheminformatics fingerprint, and show that the deep representation is able to capture more expressive task-based information.

MLJul 3, 2018
Dynamic Control of Explore/Exploit Trade-Off In Bayesian Optimization

Dipti Jasrasaria, Edward O. Pyzer-Knapp

Bayesian optimization offers the possibility of optimizing black-box operations not accessible through traditional techniques. The success of Bayesian optimization methods such as Expected Improvement (EI) are significantly affected by the degree of trade-off between exploration and exploitation. Too much exploration can lead to inefficient optimization protocols, whilst too much exploitation leaves the protocol open to strong initial biases, and a high chance of getting stuck in a local minimum. Typically, a constant margin is used to control this trade-off, which results in yet another hyper-parameter to be optimized. We propose contextual improvement as a simple, yet effective heuristic to counter this - achieving a one-shot optimization strategy. Our proposed heuristic can be swiftly calculated and improves both the speed and robustness of discovery of optimal solutions. We demonstrate its effectiveness on both synthetic and real world problems and explore the unaccounted for uncertainty in the pre-determination of search hyperparameters controlling explore-exploit trade-off.

MLJun 4, 2018
Efficient and Scalable Batch Bayesian Optimization Using K-Means

Matthew Groves, Edward O. Pyzer-Knapp

We present K-Means Batch Bayesian Optimization (KMBBO), a novel batch sampling algorithm for Bayesian Optimization (BO). KMBBO uses unsupervised learning to efficiently estimate peaks of the model acquisition function. We show in empirical experiments that our method outperforms the current state-of-the-art batch allocation algorithms on a variety of test problems including tuning of algorithm hyper-parameters and a challenging drug discovery problem. In order to accommodate the real-world problem of high dimensional data, we propose a modification to KMBBO by combining it with compressed sensing to project the optimization into a lower dimensional subspace. We demonstrate empirically that this 2-step method outperforms algorithms where no dimensionality reduction has taken place.

MLJun 6, 2017
Parallel and Distributed Thompson Sampling for Large-scale Accelerated Exploration of Chemical Space

José Miguel Hernández-Lobato, James Requeima, Edward O. Pyzer-Knapp et al.

Chemical space is so large that brute force searches for new interesting molecules are infeasible. High-throughput virtual screening via computer cluster simulations can speed up the discovery process by collecting very large amounts of data in parallel, e.g., up to hundreds or thousands of parallel measurements. Bayesian optimization (BO) can produce additional acceleration by sequentially identifying the most useful simulations or experiments to be performed next. However, current BO methods cannot scale to the large numbers of parallel measurements and the massive libraries of molecules currently used in high-throughput screening. Here, we propose a scalable solution based on a parallel and distributed implementation of Thompson sampling (PDTS). We show that, in small scale problems, PDTS performs similarly as parallel expected improvement (EI), a batch version of the most widely used BO heuristic. Additionally, in settings where parallel EI does not scale, PDTS outperforms other scalable baselines such as a greedy search, $ε$-greedy approaches and a random search method. These results show that PDTS is a successful solution for large-scale parallel BO.

MLAug 19, 2016
Space-Filling Curves as a Novel Crystal Structure Representation for Machine Learning Models

Dipti Jasrasaria, Edward O. Pyzer-Knapp, Dmitrij Rappoport et al.

A fundamental problem in applying machine learning techniques for chemical problems is to find suitable representations for molecular and crystal structures. While the structure representations based on atom connectivities are prevalent for molecules, two-dimensional descriptors are not suitable for describing molecular crystals. In this work, we introduce the SFC-M family of feature representations, which are based on Morton space-filling curves, as an alternative means of representing crystal structures. Latent Semantic Indexing (LSI) was employed in a novel setting to reduce sparsity of feature representations. The quality of the SFC-M representations were assessed by using them in combination with artificial neural networks to predict Density Functional Theory (DFT) single point, Ewald summed, lattice, and many-body dispersion energies of 839 organic molecular crystal unit cells from the Cambridge Structural Database that consist of the elements C, H, N, and O. Promising initial results suggest that the SFC-M representations merit further exploration to improve its ability to predict solid-state properties of organic crystal structures