CLNov 9, 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language ModelBigScience Workshop, Teven Le Scao, Angela Fan et al. · allen-ai, berkeley
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
LGJul 8, 2022Code
Accelerating Material Design with the Generative Toolkit for Scientific DiscoveryMatteo Manica, Jannis Born, Joris Cadow et al. · mit
With the growing availability of data within various scientific domains, generative models hold enormous potential to accelerate scientific discovery. They harness powerful representations learned from datasets to speed up the formulation of novel hypotheses with the potential to impact material discovery broadly. We present the Generative Toolkit for Scientific Discovery (GT4SD). This extensible open-source library enables scientists, developers, and researchers to train and use state-of-the-art generative models to accelerate scientific discovery focused on material design.
LGJan 29, 2023
Unifying Molecular and Textual Representations via Multi-task Language ModellingDimitrios Christofidellis, Giorgio Giannone, Jannis Born et al. · mit
The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for classical problems in molecular design and synthesis planning. These new methods have the potential to fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each task, leading to the need for problem-specific fine-tuning and neglecting task interrelations. The main obstacle in this field is the lack of a unified representation between natural language and chemical representations, complicating and limiting human-machine interaction. Here, we propose the first multi-domain, multi-task language model that can solve a wide range of tasks in both the chemical and natural language domains. Our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art baselines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which increase with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences by superseding problem-specific fine-tuning and enhancing human-model interactions.
CLAug 15, 2022Code
Z-BERT-A: a zero-shot Pipeline for Unknown Intent detectionDaniele Comi, Dimitrios Christofidellis, Pier Francesco Piazza et al. · ibm-research
Intent discovery is a crucial task in natural language processing, and it is increasingly relevant for various of industrial applications. Identifying novel, unseen intents from user inputs remains one of the biggest challenges in this field. Herein, we propose Zero-Shot-BERT-Adapters, a two-stage method for multilingual intent discovery relying on a Transformer architecture, fine-tuned with Adapters. We train the model for Natural Language Inference (NLI) and later perform unknown intent classification in a zero-shot setting for multiple languages. In our evaluation, we first analyze the quality of the model after adaptive fine-tuning on known classes. Secondly, we evaluate its performance in casting intent classification as an NLI task. Lastly, we test the zero-shot performance of the model on unseen classes, showing how Zero-Shot-BERT-Adapters can effectively perform intent discovery by generating semantically similar intents, if not equal, to the ground-truth ones. Our experiments show how Zero-Shot-BERT-Adapters outperforms various baselines in two zero-shot settings: known intent classification and unseen intent discovery. The proposed pipeline holds the potential for broad application in customer care. It enables automated dynamic triage using a lightweight model that can be easily deployed and scaled in various business scenarios, unlike large language models. Zero-Shot-BERT-Adapters represents an innovative multi-language approach for intent discovery, enabling the online generation of novel intents. A Python package implementing the pipeline and the new datasets we compiled are available at the following link: https://github.com/GT4SD/zero-shot-bert-adapters.
LGJan 20, 2023Code
Domain-agnostic and Multi-level Evaluation of Generative ModelsGirmaw Abebe Tadesse, Jannis Born, Celia Cintas et al.
While the capabilities of generative models heavily improved in different domains (images, text, graphs, molecules, etc.), their evaluation metrics largely remain based on simplified quantities or manual inspection with limited practicality. To this end, we propose a framework for Multi-level Performance Evaluation of Generative mOdels (MPEGO), which could be employed across different domains. MPEGO aims to quantify generation performance hierarchically, starting from a sub-feature-based low-level evaluation to a global features-based high-level evaluation. MPEGO offers great customizability as the employed features are entirely user-driven and can thus be highly domain/problem-specific while being arbitrarily complex (e.g., outcomes of experimental procedures). We validate MPEGO using multiple generative models across several datasets from the material discovery domain. An ablation study is conducted to study the plausibility of intermediate steps in MPEGO. Results demonstrate that MPEGO provides a flexible, user-driven, and multi-level evaluation framework, with practical insights on the generation quality. The framework, source code, and experiments will be available at https://github.com/GT4SD/mpego.
CLFeb 20, 2025Code
Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence ModelingEric Egli, Matteo Manica, Jannis Born
Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Recently, Byte Language Models (BLMs) have emerged to overcome tokenization, yet the excessive length of bytestreams requires new architectural paradigms. Therefore, we present the Multiscale Byte Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of $5$M bytes on single GPU in full model precision. We thoroughly examine MBLM's performance with Transformer and Mamba blocks on both unimodal and multimodal tasks. Our experiments demonstrate that hybrid architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational efficiency. To the best of our knowledge, we present the first evaluation of BLMs on visual Q\&A tasks and find that, despite serializing images and the absence of an encoder, a MBLM with pure next token prediction can match custom CNN-LSTM architectures with designated classification heads. We show that MBLMs exhibit strong adaptability in integrating diverse data representations, including pixel and image filestream bytes, underlining their potential toward omnimodal foundation models. Source code is publicly available at: https://github.com/ai4sd/multiscale-byte-lm
LGFeb 1, 2022Code
Regression Transformer: Concurrent sequence regression and generation for molecular language modelingJannis Born, Matteo Manica
Despite significant progress of generative models in the natural sciences, their controllability remains challenging. One fundamentally missing aspect of molecular or protein generative models is an inductive bias that can reflect continuous properties of interest. To that end, we propose the Regression Transformer (RT), a novel method that abstracts regression as a conditional sequence modeling problem. This introduces a new paradigm of multitask language models which seamlessly bridge sequence regression and conditional sequence generation. We thoroughly demonstrate that, despite using a nominal-scale training objective, the RT matches or surpasses the performance of conventional regression models in property prediction tasks of small molecules, proteins and chemical reactions. Critically, priming the same model with continuous properties yields a highly competitive conditional generative model that outperforms specialized approaches in a substructure-constrained, property-driven molecule generation benchmark. Our dichotomous approach is facilitated by a novel, alternating training scheme that enables the model to decorate seed sequences by desired properties, e.g., to optimize reaction yield. In sum, the RT is the first report of a multitask model that concurrently excels at predictive and generative tasks in biochemistry. This finds particular application in property-driven, local exploration of the chemical or protein space and could pave the road toward foundation models in material design. The code to reproduce all experiments of the paper is available at: https://github.com/IBM/regression-transformer
LGOct 15, 2021Code
Multitask Prompted Training Enables Zero-Shot Task GeneralizationVictor Sanh, Albert Webson, Colin Raffel et al.
Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models' pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pretrained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at https://github.com/bigscience-workshop/t-zero and all prompts are available at https://github.com/bigscience-workshop/promptsource.
BMDec 5, 2020Code
Pre-training Protein Language Models with Label-Agnostic Binding Pairs Enhances Performance in Downstream TasksModestas Filipavicius, Matteo Manica, Joris Cadow et al.
Less than 1% of protein sequences are structurally and functionally annotated. Natural Language Processing (NLP) community has recently embraced self-supervised learning as a powerful approach to learn representations from unlabeled text, in large part due to the attention-based context-aware Transformer models. In this work we present a modification to the RoBERTa model by inputting during pre-training a mixture of binding and non-binding protein sequences (from STRING database). However, the sequence pairs have no label to indicate their binding status, as the model relies solely on Masked Language Modeling (MLM) objective during pre-training. After fine-tuning, such approach surpasses models trained on single protein sequences for protein-protein binding prediction, TCR-epitope binding prediction, cellular-localization and remote homology classification tasks. We suggest that the Transformer's attention mechanism contributes to protein binding site discovery. Furthermore, we compress protein sequences by 64% with the Byte Pair Encoding (BPE) vocabulary consisting of 10K subwords, each around 3-4 amino acids long. Finally, to expand the model input space to even larger proteins and multi-protein assemblies, we pre-train Longformer models that support 2,048 tokens. Further work in token-level classification for secondary structure prediction is needed. Code available at: https://github.com/PaccMann/paccmann_proteomics
DCNov 20, 2024
Transforming the Hybrid Cloud for Emerging AI WorkloadsDeming Chen, Alaa Youssef, Ruchi Pendse et al.
This white paper, developed through close collaboration between IBM Research and UIUC researchers within the IIDAI Institute, envisions transforming hybrid cloud systems to meet the growing complexity of AI workloads through innovative, full-stack co-design approaches, emphasizing usability, manageability, affordability, adaptability, efficiency, and scalability. By integrating cutting-edge technologies such as generative and agentic AI, cross-layer automation and optimization, unified control plane, and composable and adaptive system architecture, the proposed framework addresses critical challenges in energy efficiency, performance, and cost-effectiveness. Incorporating quantum computing as it matures will enable quantum-accelerated simulations for materials science, climate modeling, and other high-impact domains. Collaborative efforts between academia and industry are central to this vision, driving advancements in foundation models for material design and climate solutions, scalable multimodal data processing, and enhanced physics-based AI emulators for applications like weather forecasting and carbon sequestration. Research priorities include advancing AI agentic systems, LLM as an Abstraction (LLMaaA), AI model optimization and unified abstractions across heterogeneous infrastructure, end-to-end edge-cloud transformation, efficient programming model, middleware and platform, secure infrastructure, application-adaptive cloud systems, and new quantum-classical collaborative workflows. These ideas and solutions encompass both theoretical and practical research questions, requiring coordinated input and support from the research community. This joint initiative aims to establish hybrid clouds as secure, efficient, and sustainable platforms, fostering breakthroughs in AI-driven applications and scientific discovery across academia, industry, and society.
AIApr 3, 2025
We Need Improved Data Curation and Attribution in AI for Scientific DiscoveryMara Graziani, Antonio Foncubierta, Dimitrios Christofidellis et al.
As the interplay between human-generated and synthetic data evolves, new challenges arise in scientific discovery concerning the integrity of the data and the stability of the models. In this work, we examine the role of synthetic data as opposed to that of real experimental data for scientific research. Our analyses indicate that nearly three-quarters of experimental datasets available on open-access platforms have relatively low adoption rates, opening new opportunities to enhance their discoverability and usability by automated methods. Additionally, we observe an increasing difficulty in distinguishing synthetic from real experimental data. We propose supplementing ongoing efforts in automating synthetic data detection by increasing the focus on watermarking real experimental data, thereby strengthening data traceability and integrity. Our estimates suggest that watermarking even less than half of the real world data generated annually could help sustain model robustness, while promoting a balanced integration of synthetic and human-generated content.
CLDec 18, 2020
Understood in Translation, Transformers for Domain UnderstandingDimitrios Christofidellis, Matteo Manica, Leonidas Georgopoulos et al.
Knowledge acquisition is the essential first step of any Knowledge Graph (KG) application. This knowledge can be extracted from a given corpus (KG generation process) or specified from an existing KG (KG specification process). Focusing on domain specific solutions, knowledge acquisition is a labor intensive task usually orchestrated and supervised by subject matter experts. Specifically, the domain of interest is usually manually defined and then the needed generation or extraction tools are utilized to produce the KG. Herein, we propose a supervised machine learning method, based on Transformers, for domain definition of a corpus. We argue why such automated definition of the domain's structure is beneficial both in terms of construction time and quality of the generated graph. The proposed method is extensively validated on three public datasets (WebNLG, NYT and DocRED) by comparing it with two reference methods based on CNNs and RNNs models. The evaluation shows the efficiency of our model in this task. Focusing on scientific document understanding, we present a new health domain dataset based on publications extracted from PubMed and we successfully utilize our method on this. Lastly, we demonstrate how this work lays the foundation for fully automated and unsupervised KG generation.
CLSep 23, 2020
Hierarchical Pre-training for Sequence Labelling in Spoken DialogEmile Chapuis, Pierre Colombo, Matteo Manica et al.
Sequence labelling tasks like Dialog Act and Emotion/Sentiment identification are a key component of spoken dialog systems. In this work, we propose a new approach to learn generic representations adapted to spoken dialog, which we evaluate on a new benchmark we call Sequence labellIng evaLuatIon benChmark fOr spoken laNguagE benchmark (\texttt{SILICONE}). \texttt{SILICONE} is model-agnostic and contains 10 different datasets of various sizes. We obtain our representations with a hierarchical encoder based on transformer architectures, for which we extend two well-known pre-training objectives. Pre-training is performed on OpenSubtitles: a large corpus of spoken dialog containing over $2.3$ billion of tokens. We demonstrate how hierarchical encoders achieve competitive results with consistently fewer parameters compared to state-of-the-art models and we show their importance for both pre-training and fine-tuning.
QMMay 27, 2020
PaccMann$^{RL}$ on SARS-CoV-2: Designing antiviral candidates with conditional generative modelsJannis Born, Matteo Manica, Joris Cadow et al.
With the fast development of COVID-19 into a global pandemic, scientists around the globe are desperately searching for effective antiviral therapeutic agents. Bridging systems biology and drug discovery, we propose a deep learning framework for conditional de novo design of antiviral candidate drugs tailored against given protein targets. First, we train a multimodal ligand--protein binding affinity model on predicting affinities of antiviral compounds to target proteins and couple this model with pharmacological toxicity predictors. Exploiting this multi-objective as a reward function of a conditional molecular generator (consisting of two VAEs), we showcase a framework that navigates the chemical space toward regions with more antiviral molecules. Specifically, we explore a challenging setting of generating ligands against unseen protein targets by performing a leave-one-out-cross-validation on 41 SARS-CoV-2-related target proteins. Using deep RL, it is demonstrated that in 35 out of 41 cases, the generation is biased towards sampling more binding ligands, with an average increase of 83% comparing to an unbiased VAE. We present a case-study on a potential Envelope-protein inhibitor and perform a synthetic accessibility assessment of the best generated molecules is performed that resembles a viable roadmap towards a rapid in-vitro evaluation of potential SARS-CoV-2 inhibitors.
LGApr 2, 2020
CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative ModelsVijil Chenthamarakshan, Payel Das, Samuel C. Hoffman et al.
The novel nature of SARS-CoV-2 calls for the development of efficient de novo drug design approaches. In this study, we propose an end-to-end framework, named CogMol (Controlled Generation of Molecules), for designing new drug-like small molecules targeting novel viral proteins with high affinity and off-target selectivity. CogMol combines adaptive pre-training of a molecular SMILES Variational Autoencoder (VAE) and an efficient multi-attribute controlled sampling scheme that uses guidance from attribute predictors trained on latent features. To generate novel and optimal drug-like molecules for unseen viral targets, CogMol leverages a protein-molecule binding affinity predictor that is trained using SMILES VAE embeddings and protein sequence embeddings learned unsupervised from a large corpus. CogMol framework is applied to three SARS-CoV-2 target proteins: main protease, receptor-binding domain of the spike protein, and non-structural protein 9 replicase. The generated candidates are novel at both molecular and chemical scaffold levels when compared to the training data. CogMol also includes insilico screening for assessing toxicity of parent molecules and their metabolites with a multi-task toxicity classifier, synthetic feasibility with a chemical retrosynthesis predictor, and target structure binding with docking simulations. Docking reveals favorable binding of generated molecules to the target protein structure, where 87-95 % of high affinity molecules showed docking free energy < -6 kcal/mol. When compared to approved drugs, the majority of designed compounds show low parent molecule and metabolite toxicity and high synthetic feasibility. In summary, CogMol handles multi-constraint design of synthesizable, low-toxic, drug-like molecules with high target specificity and selectivity, and does not need target-dependent fine-tuning of the framework or target structure information.
CLFeb 21, 2020
Guider l'attention dans les modeles de sequence a sequence pour la prediction des actes de dialoguePierre Colombo, Emile Chapuis, Matteo Manica et al.
The task of predicting dialog acts (DA) based on conversational dialog is a key component in the development of conversational agents. Accurately predicting DAs requires a precise modeling of both the conversation and the global tag dependencies. We leverage seq2seq approaches widely adopted in Neural Machine Translation (NMT) to improve the modelling of tag sequentiality. Seq2seq models are known to learn complex global dependencies while currently proposed approaches using linear conditional random fields (CRF) only model local tag dependencies. In this work, we introduce a seq2seq model tailored for DA classification using: a hierarchical encoder, a novel guided attention mechanism and beam search applied to both training and inference. Compared to the state of the art our model does not require handcrafted features and is trained end-to-end. Furthermore, the proposed approach achieves an unmatched accuracy score of 85% on SwDA, and state-of-the-art accuracy score of 91.6% on MRDA.
CLFeb 20, 2020
Guiding attention in Sequence-to-sequence models for Dialogue Act predictionPierre Colombo, Emile Chapuis, Matteo Manica et al.
The task of predicting dialog acts (DA) based on conversational dialog is a key component in the development of conversational agents. Accurately predicting DAs requires a precise modeling of both the conversation and the global tag dependencies. We leverage seq2seq approaches widely adopted in Neural Machine Translation (NMT) to improve the modelling of tag sequentiality. Seq2seq models are known to learn complex global dependencies while currently proposed approaches using linear conditional random fields (CRF) only model local tag dependencies. In this work, we introduce a seq2seq model tailored for DA classification using: a hierarchical encoder, a novel guided attention mechanism and beam search applied to both training and inference. Compared to the state of the art our model does not require handcrafted features and is trained end-to-end. Furthermore, the proposed approach achieves an unmatched accuracy score of 85% on SwDA, and state-of-the-art accuracy score of 91.6% on MRDA.
BMAug 29, 2019
PaccMann$^{RL}$: Designing anticancer drugs from transcriptomic data via reinforcement learningJannis Born, Matteo Manica, Ali Oskooei et al.
With the advent of deep generative models in computational chemistry, in silico anticancer drug design has undergone an unprecedented transformation. While state-of-the-art deep learning approaches have shown potential in generating compounds with desired chemical properties, they disregard the genetic profile and properties of the target disease. Here, we introduce the first generative model capable of tailoring anticancer compounds for a specific biomolecular profile. Using a RL framework, the transcriptomic profiles of cancer cells are used as a context for the generation of candidate molecules. Our molecule generator combines two separately pretrained variational autoencoders (VAEs) - the first VAE encodes transcriptomic profiles into a smooth, latent space which in turn is used to condition a second VAE to generate novel molecular structures on the given transcriptomic profile. The generative process is optimized through PaccMann, a previously developed drug sensitivity prediction model to obtain effective anticancer compounds for the given context (i.e., transcriptomic profile). We demonstrate how the molecule generation can be biased towards compounds with high predicted inhibitory effect against individual cell lines or specific cancer sites. We verify our approach by investigating candidate drugs generated against specific cancer types and find the highest structural similarity to existing compounds with known efficacy against these cancer types. We envision our approach to transform in silico anticancer drug design by leveraging the biomolecular characteristics of the disease in order to increase success rates in lead compound discovery.
IRJul 19, 2019
An Information Extraction and Knowledge Graph Platform for Accelerating Biochemical DiscoveriesMatteo Manica, Christoph Auer, Valery Weber et al.
Information extraction and data mining in biochemical literature is a daunting task that demands resource-intensive computation and appropriate means to scale knowledge ingestion. Being able to leverage this immense source of technical information helps to drastically reduce costs and time to solution in multiple application fields from food safety to pharmaceutics. We present a scalable document ingestion system that integrates data from databases and publications (in PDF format) in a biochemistry knowledge graph (BCKG). The BCKG is a comprehensive source of knowledge that can be queried to retrieve known biochemical facts and to generate novel insights. After describing the knowledge ingestion framework, we showcase an application of our system in the field of carbohydrate enzymes. The BCKG represents a way to scale knowledge ingestion and automatically exploit prior knowledge to accelerate discovery in biochemical sciences.
LGApr 25, 2019
Towards Explainable Anticancer Compound Sensitivity Prediction via Multimodal Attention-based Convolutional EncodersMatteo Manica, Ali Oskooei, Jannis Born et al.
In line with recent advances in neural drug design and sensitivity prediction, we propose a novel architecture for interpretable prediction of anticancer compound sensitivity using a multimodal attention-based convolutional encoder. Our model is based on the three key pillars of drug sensitivity: compounds' structure in the form of a SMILES sequence, gene expression profiles of tumors and prior knowledge on intracellular interactions from protein-protein interaction networks. We demonstrate that our multiscale convolutional attention-based (MCA) encoder significantly outperforms a baseline model trained on Morgan fingerprints, a selection of encoders based on SMILES as well as previously reported state of the art for multimodal drug sensitivity prediction (R2 = 0.86 and RMSE = 0.89). Moreover, the explainability of our approach is demonstrated by a thorough analysis of the attention weights. We show that the attended genes significantly enrich apoptotic processes and that the drug attention is strongly correlated with a standard chemical structure similarity index. Finally, we report a case study of two receptor tyrosine kinase (RTK) inhibitors acting on a leukemia cell line, showcasing the ability of the model to focus on informative genes and submolecular regions of the two compounds. The demonstrated generalizability and the interpretability of our model testify its potential for in-silico prediction of anticancer compound efficacy on unseen cancer cells, positioning it as a valid solution for the development of personalized therapies as well as for the evaluation of candidate compounds in de novo drug design.
LGJan 17, 2019
NeuNetS: An Automated Synthesis Engine for Neural Network DesignAtin Sood, Benjamin Elder, Benjamin Herta et al.
Application of neural networks to a vast variety of practical applications is transforming the way AI is applied in practice. Pre-trained neural network models available through APIs or capability to custom train pre-built neural network architectures with customer data has made the consumption of AI by developers much simpler and resulted in broad adoption of these complex AI models. While prebuilt network models exist for certain scenarios, to try and meet the constraints that are unique to each application, AI teams need to think about developing custom neural network architectures that can meet the tradeoff between accuracy and memory footprint to achieve the tight constraints of their unique use-cases. However, only a small proportion of data science teams have the skills and experience needed to create a neural network from scratch, and the demand far exceeds the supply. In this paper, we present NeuNetS : An automated Neural Network Synthesis engine for custom neural network design that is available as part of IBM's AI OpenScale's product. NeuNetS is available for both Text and Image domains and can build neural networks for specific tasks in a fraction of the time it takes today with human effort, and with accuracy similar to that of human-designed AI models.
LGNov 16, 2018
PaccMann: Prediction of anticancer compound sensitivity with multi-modal attention-based neural networksAli Oskooei, Jannis Born, Matteo Manica et al.
We present a novel approach for the prediction of anticancer compound sensitivity by means of multi-modal attention-based neural networks (PaccMann). In our approach, we integrate three key pillars of drug sensitivity, namely, the molecular structure of compounds, transcriptomic profiles of cancer cells as well as prior knowledge about interactions among proteins within cells. Our models ingest a drug-cell pair consisting of SMILES encoding of a compound and the gene expression profile of a cancer cell and predicts an IC50 sensitivity value. Gene expression profiles are encoded using an attention-based encoding mechanism that assigns high weights to the most informative genes. We present and study three encoders for SMILES string of compounds: 1) bidirectional recurrent 2) convolutional 3) attention-based encoders. We compare our devised models against a baseline model that ingests engineered fingerprints to represent the molecular structure. We demonstrate that using our attention-based encoders, we can surpass the baseline model. The use of attention-based encoders enhance interpretability and enable us to identify genes, bonds and atoms that were used by the network to make a prediction.
QMAug 18, 2018
Network-based Biased Tree Ensembles (NetBiTE) for Drug Sensitivity Prediction and Drug Sensitivity Biomarker Identification in CancerAli Oskooei, Matteo Manica, Roland Mathis et al.
We present the Network-based Biased Tree Ensembles (NetBiTE) method for drug sensitivity prediction and drug sensitivity biomarker identification in cancer using a combination of prior knowledge and gene expression data. Our devised method consists of a biased tree ensemble that is built according to a probabilistic bias weight distribution. The bias weight distribution is obtained from the assignment of high weights to the drug targets and propagating the assigned weights over a protein-protein interaction network such as STRING. The propagation of weights, defines neighborhoods of influence around the drug targets and as such simulates the spread of perturbations within the cell, following drug administration. Using a synthetic dataset, we showcase how application of biased tree ensembles (BiTE) results in significant accuracy gains at a much lower computational cost compared to the unbiased random forests (RF) algorithm. We then apply NetBiTE to the Genomics of Drug Sensitivity in Cancer (GDSC) dataset and demonstrate that NetBiTE outperforms RF in predicting IC50 drug sensitivity, only for drugs that target membrane receptor pathways (MRPs): RTK, EGFR and IGFR signaling pathways. We propose based on the NetBiTE results, that for drugs that inhibit MRPs, the expression of target genes prior to drug administration is a biomarker for IC50 drug sensitivity following drug administration. We further verify and reinforce this proposition through control studies on, PI3K/MTOR signaling pathway inhibitors, a drug category that does not target MRPs, and through assignment of dummy targets to MRP inhibiting drugs and investigating the variation in NetBiTE accuracy.
MNMar 29, 2018
PIMKL: Pathway Induced Multiple Kernel LearningMatteo Manica, Joris Cadow, Roland Mathis et al.
Reliable identification of molecular biomarkers is essential for accurate patient stratification. While state-of-the-art machine learning approaches for sample classification continue to push boundaries in terms of performance, most of these methods are not able to integrate different data types and lack generalization power, limiting their application in a clinical setting. Furthermore, many methods behave as black boxes, and we have very little understanding about the mechanisms that lead to the prediction. While opaqueness concerning machine behaviour might not be a problem in deterministic domains, in health care, providing explanations about the molecular factors and phenotypes that are driving the classification is crucial to build trust in the performance of the predictive system. We propose Pathway Induced Multiple Kernel Learning (PIMKL), a novel methodology to reliably classify samples that can also help gain insights into the molecular mechanisms that underlie the classification. PIMKL exploits prior knowledge in the form of a molecular interaction network and annotated gene sets, by optimizing a mixture of pathway-induced kernels using a Multiple Kernel Learning (MKL) algorithm, an approach that has demonstrated excellent performance in different machine learning applications. After optimizing the combination of kernels for prediction of a specific phenotype, the model provides a stable molecular signature that can be interpreted in the light of the ingested prior knowledge and that can be used in transfer learning tasks.