LGOct 17, 2022
Industry-Scale Orchestrated Federated Learning for Drug DiscoveryMartijn Oldenhof, Gergely Ács, Balázs Pejó et al.
To apply federated learning to drug discovery we developed a novel platform in the context of European Innovative Medicines Initiative (IMI) project MELLODDY (grant n°831472), which was comprised of 10 pharmaceutical companies, academic research labs, large industrial companies and startups. The MELLODDY platform was the first industry-scale platform to enable the creation of a global federated model for drug discovery without sharing the confidential data sets of the individual partners. The federated model was trained on the platform by aggregating the gradients of all contributing partners in a cryptographic, secure way following each training iteration. The platform was deployed on an Amazon Web Services (AWS) multi-account architecture running Kubernetes clusters in private subnets. Organisationally, the roles of the different partners were codified as different rights and permissions on the platform and administrated in a decentralized way. The MELLODDY platform generated new scientific discoveries which are described in a companion paper.
MLMar 9, 2022
SparseChem: Fast and accurate machine learning model for small moleculesAdam Arany, Jaak Simm, Martijn Oldenhof et al.
SparseChem provides fast and accurate machine learning models for biochemical applications. Especially, the package supports very high-dimensional sparse inputs, e.g., millions of features and millions of compounds. It is possible to train classification, regression and censored regression models, or combination of them from command line. Additionally, the library can be accessed directly from Python. Source code and documentation is freely available under MIT License on GitHub.
CVMar 9, 2023
Weakly Supervised Knowledge Transfer with Probabilistic Logical Reasoning for Object DetectionMartijn Oldenhof, Adam Arany, Yves Moreau et al.
Training object detection models usually requires instance-level annotations, such as the positions and labels of all objects present in each image. Such supervision is unfortunately not always available and, more often, only image-level information is provided, also known as weak supervision. Recent works have addressed this limitation by leveraging knowledge from a richly annotated domain. However, the scope of weak supervision supported by these approaches has been very restrictive, preventing them to use all available information. In this work, we propose ProbKT, a framework based on probabilistic logical reasoning that allows to train object detection models with arbitrary types of weak supervision. We empirically show on different datasets that using all available information is beneficial as our ProbKT leads to significant improvement on target domain and better generalization compared to existing baselines. We also showcase the ability of our approach to handle complex logic statements as supervision signal.
LGApr 5, 2023
How good Neural Networks interpretation methods really are? A quantitative benchmarkAntoine Passemiers, Pietro Folco, Daniele Raimondi et al.
Saliency Maps (SMs) have been extensively used to interpret deep learning models decision by highlighting the features deemed relevant by the model. They are used on highly nonlinear problems, where linear feature selection (FS) methods fail at highlighting relevant explanatory variables. However, the reliability of gradient-based feature attribution methods such as SM has mostly been only qualitatively (visually) assessed, and quantitative benchmarks are currently missing, partially due to the lack of a definite ground truth on image data. Concerned about the apophenic biases introduced by visual assessment of these methods, in this paper we propose a synthetic quantitative benchmark for Neural Networks (NNs) interpretation methods. For this purpose, we built synthetic datasets with nonlinearly separable classes and increasing number of decoy (random) features, illustrating the challenge of FS in high-dimensional settings. We also compare these methods to conventional approaches such as mRMR or Random Forests. Our results show that our simple synthetic datasets are sufficient to challenge most of the benchmarked methods. TreeShap, mRMR and LassoNet are the best performing FS methods. We also show that, when quantifying the relevance of a few non linearly-entangled predictive features diluted in a large number of irrelevant noisy variables, neural network-based FS and interpretation methods are still far from being reliable.
LGMay 1
A Comparative Study of QSPR Methods on a Unique Multitask PAMPA datasetAndrs Formanek, Anna Vincze, Richrd Bicsak et al.
We present a unique, multitask dataset comprising 143 drug and drug candidate molecules, each evaluated on in vitro, parallel artificial-membrane permeability assays (PAMPA) using six different model membranes. Using this resource, we systematically assess the effectiveness of various molecular descriptors and regression models in predicting passive membrane permeability. The studied models range from simple linear regression to a modern pre-trained transformer architecture. Particular attention is given to the trade-off between predictive performance and model interpretability, highlighting the challenges introduced by machine learning approaches. To our knowledge, this is the most comprehensive study on simultaneous modeling of multiple organ-specific PAMPA membranes to date, offering novel insights into membrane-specific permeability profiles. We found that expert-designed physico-chemical property descriptors are more fitting for a limited sample size permeabilty study than deep learning based representations.
LGJul 19, 2024
Achieving Well-Informed Decision-Making in Drug Discovery: A Comprehensive Calibration Study using Neural Network-Based Structure-Activity ModelsHannah Rosa Friesacher, Ola Engkvist, Lewis Mervin et al.
In the drug discovery process, where experiments can be costly and time-consuming, computational models that predict drug-target interactions are valuable tools to accelerate the development of new therapeutic agents. Estimating the uncertainty inherent in these neural network predictions provides valuable information that facilitates optimal decision-making when risk assessment is crucial. However, such models can be poorly calibrated, which results in unreliable uncertainty estimates that do not reflect the true predictive uncertainty. In this study, we compare different metrics, including accuracy and calibration scores, used for model hyperparameter tuning to investigate which model selection strategy achieves well-calibrated models. Furthermore, we propose to use a computationally efficient Bayesian uncertainty estimation method named Bayesian Linear Probing (BLP), which generates Hamiltonian Monte Carlo (HMC) trajectories to obtain samples for the parameters of a Bayesian Logistic Regression fitted to the hidden layer of the baseline neural network. We report that BLP improves model calibration and achieves the performance of common uncertainty quantification methods by combining the benefits of uncertainty estimation and probability calibration methods. Finally, we show that combining post hoc calibration method with well-performing uncertainty quantification approaches can boost model accuracy and calibration.
LGApr 4, 2019Code
SMURFF: a High-Performance Framework for Matrix FactorizationTom Vander Aa, Imen Chakroun, Thomas J. Ashby et al.
Bayesian Matrix Factorization (BMF) is a powerful technique for recommender systems because it produces good results and is relatively robust against overfitting. Yet BMF is more computationally intensive and thus more challenging to implement for large datasets. In this work we present SMURFF a high-performance feature-rich framework to compose and construct different Bayesian matrix-factorization methods. The framework has been successfully used in to do large scale runs of compound-activity prediction. SMURFF is available as open-source and can be used both on a supercomputer and on a desktop or laptop machine. Documentation and several examples are provided as Jupyter notebooks using SMURFF's high-level Python API.
CVApr 2, 2024
Atom-Level Optical Chemical Structure Recognition with Limited SupervisionMartijn Oldenhof, Edward De Brouwer, Adam Arany et al.
Identifying the chemical structure from a graphical representation, or image, of a molecule is a challenging pattern recognition task that would greatly benefit drug development. Yet, existing methods for chemical structure recognition do not typically generalize well, and show diminished effectiveness when confronted with domains where data is sparse, or costly to generate, such as hand-drawn molecule images. To address this limitation, we propose a new chemical structure recognition tool that delivers state-of-the-art performance and can adapt to new domains with a limited number of data samples and supervision. Unlike previous approaches, our method provides atom-level localization, and can therefore segment the image into the different atoms and bonds. Our model is the first model to perform OCSR with atom-level entity detection with only SMILES supervision. Through rigorous and extensive benchmarking, we demonstrate the preeminence of our chemical structure recognition approach in terms of data efficiency, accuracy, and atom-level entity prediction.
LGDec 2, 2024
Federated Block-Term Tensor Regression for decentralised data analysis in healthcareAxel Faes, Ashkan Pirmani, Yves Moreau et al.
Block-Term Tensor Regression (BTTR) has proven to be a powerful tool for modeling complex, high-dimensional data by leveraging multilinear relationships, making it particularly well-suited for applications in healthcare and neuroscience. However, traditional implementations of BTTR rely on centralized datasets, which pose significant privacy risks and hinder collaboration across institutions. To address these challenges, we introduce Federated Block-Term Tensor Regression (FBTTR), an extension of BTTR designed for federated learning scenarios. FBTTR enables decentralized data analysis, allowing institutions to collaboratively build predictive models while preserving data privacy and complying with regulations. FBTTR represents a major step forward in applying tensor regression to federated learning environments. Its performance is evaluated in two case studies: finger movement decoding from Electrocorticography (ECoG) signals and heart disease prediction. In the first case study, using the BCI Competition IV dataset, FBTTR outperforms non-multilinear models, demonstrating superior accuracy in decoding finger movements. For the dataset, for subject 3, the thumb obtained a performance of 0.76 $\pm$ .05 compared to 0.71 $\pm$ 0.05 for centralised BTTR. In the second case study, FBTTR is applied to predict heart disease using real-world clinical datasets, outperforming both standard federated learning approaches and centralized BTTR models. In the Fed-Heart-Disease Dataset, an AUC-ROC was obtained of 0.872 $\pm$ 0.02 and an accuracy of 0.772 $\pm$ 0.02 compared to 0.812 $\pm$ 0.003 and 0.753 $\pm$ 0.007 for the centralized model.
LGMar 25, 2021
Self-Labeling of Fully Mediating Representations by Graph AlignmentMartijn Oldenhof, Adam Arany, Yves Moreau et al.
To be able to predict a molecular graph structure ($W$) given a 2D image of a chemical compound ($U$) is a challenging problem in machine learning. We are interested to learn $f: U \rightarrow W$ where we have a fully mediating representation $V$ such that $f$ factors into $U \rightarrow V \rightarrow W$. However, observing V requires detailed and expensive labels. We propose graph aligning approach that generates rich or detailed labels given normal labels $W$. In this paper we investigate the scenario of domain adaptation from the source domain where we have access to the expensive labels $V$ to the target domain where only normal labels W are available. Focusing on the problem of predicting chemical compound graphs from 2D images the fully mediating layer is represented using the planar embedding of the chemical graph structure we are predicting. The use of a fully mediating layer implies some assumptions on the mechanism of the underlying process. However if the assumptions are correct it should allow the machine learning model to be more interpretable, generalize better and be more data efficient at training time. The empirical results show that, using only 4000 data points, we obtain up to 4x improvement of performance after domain adaptation to target domain compared to pretrained model only on the source domain. After domain adaptation, the model is even able to detect atom types that were never seen in the original source domain. Finally, on the Maybridge data set the proposed self-labeling approach reached higher performance than the current state of the art.
LGFeb 15, 2021
Topological Graph Neural NetworksMax Horn, Edward De Brouwer, Michael Moor et al.
Graph neural networks (GNNs) are a powerful architecture for tackling graph learning tasks, yet have been shown to be oblivious to eminent substructures such as cycles. We present TOGL, a novel layer that incorporates global topological information of a graph using persistent homology. TOGL can be easily integrated into any type of GNN and is strictly more expressive (in terms the Weisfeiler--Lehman graph isomorphism test) than message-passing GNNs. Augmenting GNNs with TOGL leads to improved predictive performance for graph and node classification tasks, both on synthetic data sets, which can be classified by humans using their topology but not by ordinary GNNs, and on real-world data.
LGNov 9, 2020
Longitudinal modeling of MS patient trajectories improves predictions of disability progressionEdward De Brouwer, Thijs Becker, Yves Moreau et al.
Research in Multiple Sclerosis (MS) has recently focused on extracting knowledge from real-world clinical data sources. This type of data is more abundant than data produced during clinical trials and potentially more informative about real-world clinical practice. However, this comes at the cost of less curated and controlled data sets. In this work, we address the task of optimally extracting information from longitudinal patient data in the real-world setting with a special focus on the sporadic sampling problem. Using the MSBase registry, we show that with machine learning methods suited for patient trajectories modeling, such as recurrent neural networks and tensor factorization, we can predict disability progression of patients in a two-year horizon with an ROC-AUC of 0.86, which represents a 33% decrease in the ranking pair error (1-AUC) compared to reference methods using static clinical features. Compared to the models available in the literature, this work uses the most complete patient history for MS disease progression prediction.
COSep 25, 2020
Multilevel Gibbs Sampling for Bayesian RegressionJoris Tavernier, Jaak Simm, Adam Arany et al.
Bayesian regression remains a simple but effective tool based on Bayesian inference techniques. For large-scale applications, with complicated posterior distributions, Markov Chain Monte Carlo methods are applied. To improve the well-known computational burden of Markov Chain Monte Carlo approach for Bayesian regression, we developed a multilevel Gibbs sampler for Bayesian regression of linear mixed models. The level hierarchy of data matrices is created by clustering the features and/or samples of data matrices. Additionally, the use of correlated samples is investigated for variance reduction to improve the convergence of the Markov Chain. Testing on a diverse set of data sets, speed-up is achieved for almost all of them without significant loss in predictive performance.
MLFeb 23, 2020
ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep LearningMartijn Oldenhof, Adam Arany, Yves Moreau et al.
In drug discovery, knowledge of the graph structure of chemical compounds is essential. Many thousands of scientific articles in chemistry and pharmaceutical sciences have investigated chemical compounds, but in cases the details of the structure of these chemical compounds is published only as an images. A tool to analyze these images automatically and convert them into a chemical graph structure would be useful for many applications, such drug discovery. A few such tools are available and they are mostly derived from optical character recognition. However, our evaluation of the performance of those tools reveals that they make often mistakes in detecting the correct bond multiplicity and stereochemical information. In addition, errors sometimes even lead to missing atoms in the resulting graph. In our work, we address these issues by developing a compound recognition method based on machine learning. More specifically, we develop a deep neural network model for optical compound recognition. The deep learning solution presented here consists of a segmentation model, followed by three classification models that predict atom locations, bonds and charges. Furthermore, this model not only predicts the graph structure of the molecule but also produces all information necessary to relate each component of the resulting graph to the source image. This solution is scalable and could rapidly process thousands of images. Finally, we compare empirically the proposed method to a well-established tool and observe significant error reductions.
MLJul 25, 2019
Expressive Graph Informer NetworksJaak Simm, Adam Arany, Edward De Brouwer et al.
Applying machine learning to molecules is challenging because of their natural representation as graphs rather than vectors.Several architectures have been recently proposed for deep learning from molecular graphs, but they suffer from informationbottlenecks because they only pass information from a graph node to its direct neighbors. Here, we introduce a more expressiveroute-based multi-attention mechanism that incorporates features from routes between node pairs. We call the resulting methodGraph Informer. A single network layer can therefore attend to nodes several steps away. We show empirically that the proposedmethod compares favorably against existing approaches in two prediction tasks: (1) 13C Nuclear Magnetic Resonance (NMR)spectra, improving the state-of-the-art with an MAE of 1.35 ppm and (2) predicting drug bioactivity and toxicity. Additionally, wedevelop a variant called injective Graph Informer that isprovablyas powerful as the Weisfeiler-Lehman test for graph isomorphism.Furthermore, we demonstrate that the route information allows the method to be informed about thenonlocal topologyof the graphand, thus, even go beyond the capabilities of the Weisfeiler-Lehman test.
LGMay 29, 2019
GRU-ODE-Bayes: Continuous modeling of sporadically-observed time seriesEdward De Brouwer, Jaak Simm, Adam Arany et al.
Modeling real-world multidimensional time series can be particularly challenging when these are sporadically observed (i.e., sampling is irregular both in time and across dimensions)-such as in the case of clinical patient data. To address these challenges, we propose (1) a continuous-time version of the Gated Recurrent Unit, building upon the recent Neural Ordinary Differential Equations (Chen et al., 2018), and (2) a Bayesian update network that processes the sporadic observations. We bring these two ideas together in our GRU-ODE-Bayes method. We then demonstrate that the proposed method encodes a continuity prior for the latent process and that it can exactly represent the Fokker-Planck dynamics of complex processes driven by a multidimensional stochastic differential equation. Additionally, empirical evaluation shows that our method outperforms the state of the art on both synthetic data and real-world data with applications in healthcare and climate forecast. What is more, the continuity prior is shown to be well suited for low number of samples settings.
LGNov 26, 2018
Deep Ensemble Tensor Factorization for Longitudinal Patient Trajectories ClassificationEdward De Brouwer, Jaak Simm, Adam Arany et al.
We present a generative approach to classify scarcely observed longitudinal patient trajectories. The available time series are represented as tensors and factorized using generative deep recurrent neural networks. The learned factors represent the patient data in a compact way and can then be used in a downstream classification task. For more robustness and accuracy in the predictions, we used an ensemble of those deep generative models to mimic Bayesian posterior sampling. We illustrate the performance of our architecture on an intensive-care case study of in-hospital mortality prediction with 96 longitudinal measurement types measured across the first 48-hour from admission. Our combination of generative and ensemble strategies achieves an AUC of over 0.85, and outperforms the SAPS-II mortality score and GRU baselines.
AISep 14, 2017
Fast semi-supervised discriminant analysis for binary classification of large data-setsJoris Tavernier, Jaak Simm, Karl Meerbergen et al.
High-dimensional data requires scalable algorithms. We propose and analyze three scalable and related algorithms for semi-supervised discriminant analysis (SDA). These methods are based on Krylov subspace methods which exploit the data sparsity and the shift-invariance of Krylov subspaces. In addition, the problem definition was improved by adding centralization to the semi-supervised setting. The proposed methods are evaluated on a industry-scale data set from a pharmaceutical company to predict compound activity on target proteins. The results show that SDA achieves good predictive performance and our methods only require a few seconds, significantly improving computation time on previous state of the art.
MLDec 1, 2015
Highly Scalable Tensor Factorization for Prediction of Drug-Protein Interaction TypeAdam Arany, Jaak Simm, Pooya Zakeri et al.
The understanding of the type of inhibitory interaction plays an important role in drug design. Therefore, researchers are interested to know whether a drug has competitive or non-competitive interaction to particular protein targets. Method: to analyze the interaction types we propose factorization method Macau which allows us to combine different measurement types into a single tensor together with proteins and compounds. The compounds are characterized by high dimensional 2D ECFP fingerprints. The novelty of the proposed method is that using a specially designed noise injection MCMC sampler it can incorporate high dimensional side information, i.e., millions of unique 2D ECFP compound features, even for large scale datasets of millions of compounds. Without the side information, in this case, the tensor factorization would be practically futile. Results: using public IC50 and Ki data from ChEMBL we trained a model from where we can identify the latent subspace separating the two measurement types (IC50 and Ki). The results suggest the proposed method can detect the competitive inhibitory activity between compounds and proteins.
MLSep 15, 2015
Macau: Scalable Bayesian Multi-relational Factorization with Side Information using MCMCJaak Simm, Adam Arany, Pooya Zakeri et al.
We propose Macau, a powerful and flexible Bayesian factorization method for heterogeneous data. Our model can factorize any set of entities and relations that can be represented by a relational model, including tensors and also multiple relations for each entity. Macau can also incorporate side information, specifically entity and relation features, which are crucial for predicting sparsely observed relations. Macau scales to millions of entity instances, hundred millions of observations, and sparse entity features with millions of dimensions. To achieve the scale up, we specially designed sampling procedure for entity and relation features that relies primarily on noise injection in linear regressions. We show performance and advanced features of Macau in a set of experiments, including challenging drug-protein activity prediction task.
LGDec 2, 2014
Easy Hyperparameter Search Using OptunityMarc Claesen, Jaak Simm, Dusan Popovic et al.
Optunity is a free software package dedicated to hyperparameter optimization. It contains various types of solvers, ranging from undirected methods to direct search, particle swarm and evolutionary optimization. The design focuses on ease of use, flexibility, code clarity and interoperability with existing software in all machine learning environments. Optunity is written in Python and contains interfaces to environments such as R and MATLAB. Optunity uses a BSD license and is freely available online at http://www.optunity.net.