Daniele Ramazzotti

LG
17papers
173citations
Novelty36%
AI Score23

17 Papers

LGJun 7, 2017Code
Learning the structure of Bayesian Networks via the bootstrap

Giulio Caravagna, Daniele Ramazzotti

Learning the structure of dependencies among multiple random variables is a problem of considerable theoretical and practical interest. Within the context of Bayesian Networks, a practical and surprisingly successful solution to this learning problem is achieved by adopting score-functions optimisation schema, augmented with multiple restarts to avoid local optima. Yet, the conditions under which such strategies work well are poorly understood, and there are also some intrinsic limitations to learning the directionality of the interaction among the variables. Following an early intuition of Friedman and Koller, we propose to decouple the learning problem into two steps: first, we identify a partial ordering among input variables which constrains the structural learning problem, and then propose an effective bootstrap-based algorithm to simulate augmented data sets, and select the most important dependencies among the variables. By using several synthetic data sets, we show that our algorithm yields better recovery performance than the state of the art, increasing the chances of identifying a globally-optimal solution to the learning problem, and solving also well-known identifiability issues that affect the standard approach. We use our new algorithm to infer statistical dependencies between cancer driver somatic mutations detected by high-throughput genome sequencing data of multiple colorectal cancer patients. In this way, we also show how the proposed methods can shade new insights about cancer initiation, and progression. Code: https://github.com/caravagn/Bootstrap-based-Learning

GNMay 8, 2017Code
cyTRON and cyTRON/JS: two Cytoscape-based applications for the inference of cancer evolution models

Lucrezia Patruno, Edoardo Galimberti, Daniele Ramazzotti et al.

The increasing availability of sequencing data of cancer samples is fueling the development of algorithmic strategies to investigate tumor heterogeneity and infer reliable models of cancer evolution. We here build up on previous works on cancer progression inference from genomic alteration data, to deliver two distinct Cytoscape-based applications, which allow to produce, visualize and manipulate cancer evolution models, also by interacting with public genomic and proteomics databases. In particular, we here introduce cyTRON, a stand-alone Cytoscape app, and cyTRON/JS, a web application which employs the functionalities of Cytoscape/JS. cyTRON was developed in Java; the code is available at https://github.com/BIMIB-DISCo/cyTRON and on the Cytoscape App Store http://apps.cytoscape.org/apps/cytron. cyTRON/JS was developed in JavaScript and R; the source code of the tool is available at https://github.com/BIMIB-DISCo/cyTRON-js and the tool is accessible from https://bimib.disco.unimib.it/cytronjs/welcome.

GNMar 21, 2017Code
SIMLR: A Tool for Large-Scale Genomic Analyses by Multi-Kernel Learning

Bo Wang, Daniele Ramazzotti, Luca De Sano et al.

We here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a sample-to-sample similarity measure from expression data observed for heterogenous samples. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of samples. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization. Availability and Implementation SIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on http://bioconductor.org.

MLAug 26, 2014Code
PMCE: efficient inference of expressive models of cancer evolution with high prognostic power

Fabrizio Angaroni, Kevin Chen, Chiara Damiani et al.

Motivation: Driver (epi)genomic alterations underlie the positive selection of cancer subpopulations, which promotes drug resistance and relapse. Even though substantial heterogeneity is witnessed in most cancer types, mutation accumulation patterns can be regularly found and can be exploited to reconstruct predictive models of cancer evolution. Yet, available methods cannot infer logical formulas connecting events to represent alternative evolutionary routes or convergent evolution. Results: We introduce PMCE, an expressive framework that leverages mutational profiles from cross-sectional sequencing data to infer probabilistic graphical models of cancer evolution including arbitrary logical formulas, and which outperforms the state-of-the-art in terms of accuracy and robustness to noise, on simulations. The application of PMCE to 7866 samples from the TCGA database allows us to identify a highly significant correlation between the predicted evolutionary paths and the overall survival in 7 tumor types, proving that our approach can effectively stratify cancer patients in reliable risk groups. Availability: PMCE is freely available at https://github.com/BIMIB-DISCo/PMCE, in addition to the code to replicate all the analyses presented in the manuscript. Contacts: daniele.ramazzotti@unimib.it, alex.graudenzi@ibfm.cnr.it.

QMAug 6, 2018
Improved survival of cancer patients admitted to the ICU between 2002 and 2011 at a U.S. teaching hospital

Chris Sauer, Jinghui Dong, Leo Celi et al.

Over the past decades, both critical care and cancer care have improved substantially. Due to increased cancer-specific survival, we hypothesized that both the number of cancer patients admitted to the ICU and overall survival have increased since the millennium change. MIMIC-III, a freely accessible critical care database of Beth Israel Deaconess Medical Center, Boston, USA was used to retrospectively study trends and outcomes of cancer patients admitted to the ICU between 2002 and 2011. Multiple logistic regression analysis was performed to adjust for confounders of 28-day and 1-year mortality. Out of 41,468 unique ICU admissions, 1,100 hemato-oncologic, 3,953 oncologic and 49 patients with both a hematological and solid malignancy were analyzed. Hematological patients had higher critical illness scores than non-cancer patients, while oncologic patients had similar APACHE-III and SOFA-scores compared to non-cancer patients. In the univariate analysis, cancer was strongly associated with mortality (OR= 2.74, 95%CI: 2.56, 2.94). Over the 10-year study period, 28-day mortality of cancer patients decreased by 30%. This trend persisted after adjustment for covariates, with cancer patients having significantly higher mortality (OR=2.63, 95%CI: 2.38, 2.88). Between 2002 and 2011, both the adjusted odds of 28-day mortality and the adjusted odds of 1-year mortality for cancer patients decreased by 6% (95%CI: 4%, 9%). Having cancer was the strongest single predictor of 1-year mortality in the multivariate model (OR=4.47, 95%CI: 4.11, 4.84).

SIAug 6, 2018
Probabilistic Causal Analysis of Social Influence

Francesco Bonchi, Francesco Gullo, Bud Mishra et al.

Mastering the dynamics of social influence requires separating, in a database of information propagation traces, the genuine causal processes from temporal correlation, i.e., homophily and other spurious causes. However, most studies to characterize social influence, and, in general, most data-science analyses focus on correlations, statistical independence, or conditional independence. Only recently, there has been a resurgence of interest in "causal data science", e.g., grounded on causality theories. In this paper we adopt a principled causal approach to the analysis of social influence from information-propagation data, rooted in the theory of probabilistic causation. Our approach consists of two phases. In the first one, in order to avoid the pitfalls of misinterpreting causation when the data spans a mixture of several subtypes ("Simpson's paradox"), we partition the set of propagation traces into groups, in such a way that each group is as less contradictory as possible in terms of the hierarchical structure of information propagation. To achieve this goal, we borrow the notion of "agony" and define the Agony-bounded Partitioning problem, which we prove being hard, and for which we develop two efficient algorithms with approximation guarantees. In the second phase, for each group from the first phase, we apply a constrained MLE approach to ultimately learn a minimal causal topology. Experiments on synthetic data show that our method is able to retrieve the genuine causal arcs w.r.t. a ground-truth generative model. Experiments on real data show that, by focusing only on the extracted causal structures instead of the whole social graph, the effectiveness of predicting influence spread is significantly improved.

CYAug 4, 2018
Withholding or withdrawing invasive interventions may not accelerate time to death among dying ICU patients

Daniele Ramazzotti, Peter Clardy, Leo Anthony Celi et al.

We considered observational data available from the MIMIC-III open-access ICU database and collected within a study period between year 2002 up to 2011. If a patient had multiple admissions to the ICU during the 30 days before death, only the first stay was analyzed, leading to a final set of 6,436 unique ICU admissions during the study period. We tested two hypotheses: (i) administration of invasive intervention during the ICU stay immediately preceding end-of-life would decrease over the study time period and (ii) time-to-death from ICU admission would also decrease, due to the decrease in invasive intervention administration. To investigate the latter hypothesis, we performed a subgroups analysis by considering patients with lowest and highest severity. To do so, we stratified the patients based on their SAPS I scores, and we considered patients within the first and the third tertiles of the score. We then assessed differences in trends within these groups between years 2002-05 vs. 2008-11. Comparing the period 2002-2005 vs. 2008-2011, we found a reduction in endotracheal ventilation among patients who died within 30 days of ICU admission (120.8 vs. 68.5 hours for the lowest severity patients, p<0.001; 47.7 vs. 46.0 hours for the highest severity patients, p=0.004). This is explained in part by an increase in the use of non-invasive ventilation. Comparing the period 2002-2005 vs. 2008-2011, we found a reduction in the use of vasopressors and inotropes among patients with the lowest severity who died within 30 days of ICU admission (41.8 vs. 36.2 hours, p<0.001) but not among those with the highest severity. Despite a reduction in the use of invasive interventions, we did not find a reduction in the time to death between 2002-2005 vs. 2008-2011 (7.8 days vs. 8.2 days for the lowest severity patients, p=0.32; 2.1 days vs. 2.0 days for the highest severity patients, p=0.74).

LGAug 3, 2018
Investigating the performance of multi-objective optimization when learning Bayesian Networks

Paolo Cazzaniga, Marco S. Nobile, Daniele Ramazzotti

Bayesian Networks have been widely used in the last decades in many fields, to describe statistical dependencies among random variables. In general, learning the structure of such models is a problem with considerable theoretical interest that poses many challenges. On the one hand, it is a well-known NP-complete problem, practically hardened by the huge search space of possible solutions. On the other hand, the phenomenon of I-equivalence, i.e., different graphical structures underpinning the same set of statistical dependencies, may lead to multimodal fitness landscapes further hindering maximum likelihood approaches to solve the task. In particular, we exploit the NSGA-II multi-objective optimization procedure in order to explicitly account for both the likelihood of a solution and the number of selected arcs, by setting these as the two objective functions of the method. The aim of this work is to investigate the behavior of NSGA-II and analyse the quality of its solutions. We thus thoroughly examined the optimization results obtained on a wide set of simulated data, by considering both the goodness of the inferred solutions in terms of the objective functions values achieved, and by comparing the retrieved structures with the ground truth, i.e., the networks used to generate the target data. Our results show that NSGA-II can converge to solutions characterized by better likelihood and less arcs than classic approaches, although paradoxically characterized in many cases by a lower similarity with the target network.

GNSep 4, 2017
Learning mutational graphs of individual tumour evolution from single-cell and multi-region sequencing data

Daniele Ramazzotti, Alex Graudenzi, Luca De Sano et al.

Background. A large number of algorithms is being developed to reconstruct evolutionary models of individual tumours from genome sequencing data. Most methods can analyze multiple samples collected either through bulk multi-region sequencing experiments or the sequencing of individual cancer cells. However, rarely the same method can support both data types. Results. We introduce TRaIT, a computational framework to infer mutational graphs that model the accumulation of multiple types of somatic alterations driving tumour evolution. Compared to other tools, TRaIT supports multi-region and single-cell sequencing data within the same statistical framework, and delivers expressive models that capture many complex evolutionary phenomena. TRaIT improves accuracy, robustness to data-specific errors and computational complexity compared to competing methods. Conclusions. We show that the application of TRaIT to single-cell and multi-region cancer datasets can produce accurate and reliable models of single-tumour evolution, quantify the extent of intra-tumour heterogeneity and generate new testable experimental hypotheses.

LGApr 27, 2017
Learning the structure of Bayesian Networks: A quantitative assessment of the effect of different algorithmic schemes

Stefano Beretta, Mauro Castelli, Ivo Goncalves et al.

One of the most challenging tasks when adopting Bayesian Networks (BNs) is the one of learning their structure from data. This task is complicated by the huge search space of possible solutions, and by the fact that the problem is NP-hard. Hence, full enumeration of all the possible solutions is not always feasible and approximations are often required. However, to the best of our knowledge, a quantitative analysis of the performance and characteristics of the different heuristics to solve this problem has never been done before. For this reason, in this work, we provide a detailed comparison of many different state-of-the-arts methods for structural learning on simulated data considering both BNs with discrete and continuous variables, and with different rates of noise in the data. In particular, we investigate the performance of different widespread scores and algorithmic approaches proposed for the inference and the statistical pitfalls within them.

LGMar 8, 2017
Causal Data Science for Financial Stress Testing

Gelin Gao, Bud Mishra, Daniele Ramazzotti

The most recent financial upheavals have cast doubt on the adequacy of some of the conventional quantitative risk management strategies, such as VaR (Value at Risk), in many common situations. Consequently, there has been an increasing need for verisimilar financial stress testings, namely simulating and analyzing financial portfolios in extreme, albeit rare scenarios. Unlike conventional risk management which exploits statistical correlations among financial instruments, here we focus our analysis on the notion of probabilistic causation, which is embodied by Suppes-Bayes Causal Networks (SBCNs); SBCNs are probabilistic graphical models that have many attractive features in terms of more accurate causal analysis for generating financial stress scenarios. In this paper, we present a novel approach for conducting stress testing of financial portfolios based on SBCNs in combination with classical machine learning classification tools. The resulting method is shown to be capable of correctly discovering the causal relationships among financial factors that affect the portfolios and thus, simulating stress testing scenarios with a higher accuracy and lower computational complexity than conventional Monte Carlo Simulations.

LGMar 8, 2017
Efficient computational strategies to learn the structure of probabilistic graphical models of cumulative phenomena

Daniele Ramazzotti, Marco S. Nobile, Marco Antoniotti et al.

Structural learning of Bayesian Networks (BNs) is a NP-hard problem, which is further complicated by many theoretical issues, such as the I-equivalence among different structures. In this work, we focus on a specific subclass of BNs, named Suppes-Bayes Causal Networks (SBCNs), which include specific structural constraints based on Suppes' probabilistic causation to efficiently model cumulative phenomena. Here we compare the performance, via extensive simulations, of various state-of-the-art search strategies, such as local search techniques and Genetic Algorithms, as well as of distinct regularization methods. The assessment is performed on a large number of simulated datasets from topologies with distinct levels of complexity, various sample size and different rates of errors in the data. Among the main results, we show that the introduction of Suppes' constraints dramatically improve the inference accuracy, by reducing the solution space and providing a temporal ordering on the variables. We also report on trade-offs among different search techniques that can be efficiently employed in distinct experimental settings. This manuscript is an extended version of the paper "Structural Learning of Probabilistic Graphical Models of Cumulative Phenomena" presented at the 2018 International Conference on Computational Science.

LGMar 8, 2017
Combining Bayesian Approaches and Evolutionary Techniques for the Inference of Breast Cancer Networks

Stefano Beretta, Mauro Castelli, Ivo Goncalves et al.

Gene and protein networks are very important to model complex large-scale systems in molecular biology. Inferring or reverseengineering such networks can be defined as the process of identifying gene/protein interactions from experimental data through computational analysis. However, this task is typically complicated by the enormously large scale of the unknowns in a rather small sample size. Furthermore, when the goal is to study causal relationships within the network, tools capable of overcoming the limitations of correlation networks are required. In this work, we make use of Bayesian Graphical Models to attach this problem and, specifically, we perform a comparative study of different state-of-the-art heuristics, analyzing their performance in inferring the structure of the Bayesian Network from breast cancer data.

LGMar 8, 2017
Parallel Implementation of Efficient Search Schemes for the Inference of Cancer Progression Models

Daniele Ramazzotti, Marco S. Nobile, Paolo Cazzaniga et al.

The emergence and development of cancer is a consequence of the accumulation over time of genomic mutations involving a specific set of genes, which provides the cancer clones with a functional selective advantage. In this work, we model the order of accumulation of such mutations during the progression, which eventually leads to the disease, by means of probabilistic graphic models, i.e., Bayesian Networks (BNs). We investigate how to perform the task of learning the structure of such BNs, according to experimental evidence, adopting a global optimization meta-heuristics. In particular, in this work we rely on Genetic Algorithms, and to strongly reduce the execution time of the inference -- which can also involve multiple repetitions to collect statistically significant assessments of the data -- we distribute the calculations using both multi-threading and a multi-node architecture. The results show that our approach is characterized by good accuracy and specificity; we also demonstrate its feasibility, thanks to a 84x reduction of the overall execution time with respect to a traditional sequential implementation.

AIFeb 25, 2016
Modeling cumulative biological phenomena with Suppes-Bayes Causal Networks

Daniele Ramazzotti, Alex Graudenzi, Giulio Caravagna et al.

Several diseases related to cell proliferation are characterized by the accumulation of somatic DNA changes, with respect to wildtype conditions. Cancer and HIV are two common examples of such diseases, where the mutational load in the cancerous/viral population increases over time. In these cases, selective pressures are often observed along with competition, cooperation and parasitism among distinct cellular clones. Recently, we presented a mathematical framework to model these phenomena, based on a combination of Bayesian inference and Suppes' theory of probabilistic causation, depicted in graphical structures dubbed Suppes-Bayes Causal Networks (SBCNs). SBCNs are generative probabilistic graphical models that recapitulate the potential ordering of accumulation of such DNA changes during the progression of the disease. Such models can be inferred from data by exploiting likelihood-based model-selection strategies with regularization. In this paper we discuss the theoretical foundations of our approach and we investigate in depth the influence on the model-selection task of: (i) the poset based on Suppes' theory and (ii) different regularization strategies. Furthermore, we provide an example of application of our framework to HIV genetic data highlighting the valuable insights provided by the inferred.

LGFeb 15, 2016
A Model of Selective Advantage for the Efficient Inference of Cancer Clonal Evolution

Daniele Ramazzotti

Recently, there has been a resurgence of interest in rigorous algorithms for the inference of cancer progression from genomic data. The motivations are manifold: (i) growing NGS and single cell data from cancer patients, (ii) need for novel Data Science and Machine Learning algorithms to infer models of cancer progression, and (iii) a desire to understand the temporal and heterogeneous structure of tumor to tame its progression by efficacious therapeutic intervention. This thesis presents a multi-disciplinary effort to model tumor progression involving successive accumulation of genetic alterations, each resulting populations manifesting themselves in a cancer phenotype. The framework presented in this work along with algorithms derived from it, represents a novel approach for inferring cancer progression, whose accuracy and convergence rates surpass the existing techniques. The approach derives its power from several fields including algorithms in machine learning, theory of causality and cancer biology. Furthermore, a modular pipeline to extract ensemble-level progression models from sequenced cancer genomes is proposed. The pipeline combines state-of-the-art techniques for sample stratification, driver selection, identification of fitness-equivalent exclusive alterations and progression model inference. Furthermore, the results are validated by synthetic data with realistic generative models, and empirically interpreted in the context of real cancer datasets; in the later case, biologically significant conclusions are also highlighted. Specifically, it demonstrates the pipeline's ability to reproduce much of the knowledge on colorectal cancer, as well as to suggest novel hypotheses. Lastly, it also proves that the proposed framework can be applied to reconstruct the evolutionary history of cancer clones in single patients, as illustrated by an example from clear cell renal carcinomas.

DBOct 2, 2015
Exposing the Probabilistic Causal Structure of Discrimination

Francesco Bonchi, Sara Hajian, Bud Mishra et al.

Discrimination discovery from data is an important task aiming at identifying patterns of illegal and unethical discriminatory activities against protected-by-law groups, e.g., ethnic minorities. While any legally-valid proof of discrimination requires evidence of causality, the state-of-the-art methods are essentially correlation-based, albeit, as it is well known, correlation does not imply causation. In this paper we take a principled causal approach to the data mining problem of discrimination detection in databases. Following Suppes' probabilistic causation theory, we define a method to extract, from a dataset of historical decision records, the causal structures existing among the attributes in the data. The result is a type of constrained Bayesian network, which we dub Suppes-Bayes Causal Network (SBCN). Next, we develop a toolkit of methods based on random walks on top of the SBCN, addressing different anti-discrimination legal concepts, such as direct and indirect discrimination, group and individual discrimination, genuine requirement, and favoritism. Our experiments on real-world datasets confirm the inferential power of our approach in all these different tasks.