LGMar 2, 2017Code
MoleculeNet: A Benchmark for Molecular Machine LearningZhenqin Wu, Bharath Ramsundar, Evan N. Feinberg et al.
Molecular machine learning has been maturing rapidly over the last few years. Improved methods and the presence of larger datasets have enabled machine learning algorithms to make increasingly accurate predictions about molecular properties. However, algorithmic progress has been limited due to the lack of a standard benchmark to compare the efficacy of proposed methods; most new algorithms are benchmarked on different datasets making it challenging to gauge the quality of proposed methods. This work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance. However, this result comes with caveats. Learnable representations still struggle to deal with complex tasks under data scarcity and highly imbalanced classification. For quantum mechanical and biophysical datasets, the use of physics-aware featurizations can be more important than choice of particular learning algorithm.
CHEM-PHJan 10, 2025
Efficient Transition State Searches by Freezing String Method with Graph Neural Network PotentialsJonah Marks, Joseph Gomes
Transition state (TS) searches are a critical bottleneck in computational studies of chemical reactivity, as accurately capturing complex phenomena like bond breaking and formation events requires repeated evaluations of expensive ab-initio potential energy surfaces (PESs). While numerous algorithms have been developed to locate TSs efficiently, the computational cost of PES evaluations remains a key limitation. In this work, we develop and fine-tune a graph neural network (GNN) PES to accelerate TS searches for organic reactions. Our GNN of choice, SchNet, is first pre-trained on the ANI-1 dataset and subsequently fine-tuned on a small dataset of reactant, product, and TS structures. We integrate this GNN PES into the Freezing String Method (FSM), enabling rapid generation of TS guess geometries. Across a benchmark suite of chemically diverse reactions, our fine-tuned model (GNN-FT) achieves a 100% success rate, locating the reference TSs in all cases while reducing the number of ab-initio calculations by 72% on average compared to conventional DFT-based FSM searches. Fine-tuning reduces GNN-FT errors by orders of magnitude for out-of-distribution cases such as non-covalent interactions, and improves TS-region predictions with comparatively little data. Analysis of transition state geometries and energy errors shows that GNN-FT captures PES along the reaction coordinate with sufficient accuracy to serve as a reliable DFT surrogate. These results demonstrate that modern GNN potentials, when properly trained, can significantly reduce the cost of TS searches and broaden the scope and size of systems considered in chemical reactivity studies.
LGJul 15, 2020
Evaluating and Validating Cluster ResultsAnupriya Vysala, Joseph Gomes
Clustering is the technique to partition data according to their characteristics. Data that are similar in nature belong to the same cluster [1]. There are two types of evaluation methods to evaluate clustering quality. One is an external evaluation where the truth labels in the data sets are known in advance and the other is internal evaluation in which the evaluation is done with data set itself without true labels. In this paper, both external evaluation and internal evaluation are performed on the cluster results of the IRIS dataset. In the case of external evaluation Homogeneity, Correctness and V-measure scores are calculated for the dataset. For internal performance measures, the Silhouette Index and Sum of Square Errors are used. These internal performance measures along with the dendrogram (graphical tool from hierarchical Clustering) are used first to validate the number of clusters. Finally, as a statistical tool, we used the frequency distribution method to compare and provide a visual representation of the distribution of observations within a clustering result and the original data.
LGMay 29, 2019
Strategies for Pre-training Graph Neural NetworksWeihua Hu, Bowen Liu, Joseph Gomes et al.
Many applications of machine learning require a model to make accurate pre-dictions on test examples that are distributionally different from training ones, while task-specific labels are scarce during training. An effective approach to this challenge is to pre-train a model on related tasks where data is abundant, and then fine-tune it on a downstream task of interest. While pre-training has been effective in many language and vision domains, it remains an open question how to effectively use pre-training on graph datasets. In this paper, we develop a new strategy and self-supervised methods for pre-training Graph Neural Networks (GNNs). The key to the success of our strategy is to pre-train an expressive GNN at the level of individual nodes as well as entire graphs so that the GNN can learn useful local and global representations simultaneously. We systematically study pre-training on multiple graph classification datasets. We find that naive strategies, which pre-train GNNs at the level of either entire graphs or individual nodes, give limited improvement and can even lead to negative transfer on many downstream tasks. In contrast, our strategy avoids negative transfer and improves generalization significantly across downstream tasks, leading up to 9.4% absolute improvements in ROC-AUC over non-pre-trained models and achieving state-of-the-art performance for molecular property prediction and protein function prediction.
LGMar 23, 2018
Deep Learning Phase SegregationAmir Barati Farimani, Joseph Gomes, Rishi Sharma et al.
Phase segregation, the process by which the components of a binary mixture spontaneously separate, is a key process in the evolution and design of many chemical, mechanical, and biological systems. In this work, we present a data-driven approach for the learning, modeling, and prediction of phase segregation. A direct mapping between an initially dispersed, immiscible binary fluid and the equilibrium concentration field is learned by conditional generative convolutional neural networks. Concentration field predictions by the deep learning model conserve phase fraction, correctly predict phase transition, and reproduce area, perimeter, and total free energy distributions up to 98% accuracy.
LGSep 7, 2017
Deep Learning the Physics of Transport PhenomenaAmir Barati Farimani, Joseph Gomes, Vijay S. Pande
We have developed a new data-driven paradigm for the rapid inference, modeling and simulation of the physics of transport phenomena by deep learning. Using conditional generative adversarial networks (cGAN), we train models for the direct generation of solutions to steady state heat conduction and incompressible fluid flow purely on observation without knowledge of the underlying governing equations. Rather than using iterative numerical methods to approximate the solution of the constitutive equations, cGANs learn to directly generate the solutions to these phenomena, given arbitrary boundary conditions and domain, with high test accuracy (MAE$<$1\%) and state-of-the-art computational performance. The cGAN framework can be used to learn causal models directly from experimental observations where the underlying physical model is complex or unknown.
LGJun 6, 2017
Retrosynthetic reaction prediction using neural sequence-to-sequence modelsBowen Liu, Bharath Ramsundar, Prasad Kawthekar et al.
We describe a fully data driven model that learns to perform a retrosynthetic reaction prediction task, which is treated as a sequence-to-sequence mapping problem. The end-to-end trained model has an encoder-decoder architecture that consists of two recurrent neural networks, which has previously shown great success in solving other sequence-to-sequence prediction tasks such as machine translation. The model is trained on 50,000 experimental reaction examples from the United States patent literature, which span 10 broad reaction types that are commonly used by medicinal chemists. We find that our model performs comparably with a rule-based expert system baseline model, and also overcomes certain limitations associated with rule-based expert systems and with any machine learning approach that contains a rule-based expert system component. Our model provides an important first step towards solving the challenging problem of computational retrosynthetic analysis.
LGMar 30, 2017
Atomic Convolutional Networks for Predicting Protein-Ligand Binding AffinityJoseph Gomes, Bharath Ramsundar, Evan N. Feinberg et al.
Empirical scoring functions based on either molecular force fields or cheminformatics descriptors are widely used, in conjunction with molecular docking, during the early stages of drug discovery to predict potency and binding affinity of a drug-like molecule to a given target. These models require expert-level knowledge of physical chemistry and biology to be encoded as hand-tuned parameters or features rather than allowing the underlying model to select features in a data-driven procedure. Here, we develop a general 3-dimensional spatial convolution operation for learning atomic-level chemical interactions directly from atomic coordinates and demonstrate its application to structure-based bioactivity prediction. The atomic convolutional neural network is trained to predict the experimentally determined binding affinity of a protein-ligand complex by direct calculation of the energy associated with the complex, protein, and ligand given the crystal structure of the binding pose. Non-covalent interactions present in the complex that are absent in the protein-ligand sub-structures are identified and the model learns the interaction strength associated with these features. We test our model by predicting the binding free energy of a subset of protein-ligand complexes found in the PDBBind dataset and compare with state-of-the-art cheminformatics and machine learning-based approaches. We find that all methods achieve experimental accuracy and that atomic convolutional networks either outperform or perform competitively with the cheminformatics based methods. Unlike all previous protein-ligand prediction systems, atomic convolutional networks are end-to-end and fully-differentiable. They represent a new data-driven, physics-based deep learning model paradigm that offers a strong foundation for future improvements in structure-based bioactivity prediction.