AIJul 11, 2023Code
An Open-Source Knowledge Graph Ecosystem for the Life SciencesTiffany J. Callahan, Ignacio J. Tripodi, Adrianne L. Stefanski et al. · berkeley, harvard
Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoints and abstraction algorithms), and benchmarks (e.g., prebuilt KGs and embeddings). We evaluated the ecosystem by systematically comparing it to existing open-source KG construction methods and by analyzing its computational performance when used to construct 12 large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability.
AIJun 13, 2022
A method for comparing multiple imputation techniques: a case study on the U.S. National COVID Cohort CollaborativeElena Casiraghi, Rachel Wong, Margaret Hall et al.
Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful to assess associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases and the simple removal of these cases may introduce severe bias. For these reasons, several multiple imputation algorithms have been proposed to attempt to recover the missing information. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithms works best in a given scenario. Furthermore, the selection of each algorithm parameters and data-related modelling choices are also both crucial and challenging. In this paper, we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. The experiments presented here show that our approach could effectively highlight the most valid and performant missing-data handling strategy for our case study. Moreover, our methodology allowed us to gain an understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.
CENov 30, 2023
RNA-KG: An ontology-based knowledge graph for representing interactions involving RNA moleculesEmanuele Cavalleri, Alberto Cabri, Mauricio Soto-Gomez et al.
The "RNA world" represents a novel frontier for the study of fundamental biological processes and human diseases and is paving the way for the development of new drugs tailored to the patient's biomolecular characteristics. Although scientific data about coding and non-coding RNA molecules are continuously produced and available from public repositories, they are scattered across different databases and a centralized, uniform, and semantically consistent representation of the "RNA world" is still lacking. We propose RNA-KG, a knowledge graph encompassing biological knowledge about RNAs gathered from more than 50 public databases, integrating functional relationships with genes, proteins, and chemicals and ontologically grounded biomedical concepts. To develop RNA-KG, we first identified, pre-processed, and characterized each data source; next, we built a meta-graph that provides an ontological description of the KG by representing all the bio-molecular entities and medical concepts of interest in this domain, as well as the types of interactions connecting them. Finally, we leveraged an instance-based semantically abstracted knowledge model to specify the ontological alignment according to which RNA-KG was generated. RNA-KG can be downloaded in different formats and also queried by a SPARQL endpoint. A thorough topological analysis of the resulting heterogeneous graph provides further insights into the characteristics of the "RNA world". RNA-KG can be both directly explored and visualized, and/or analyzed by applying computational methods to infer bio-medical knowledge from its heterogeneous nodes and edges. The resource can be easily updated with new experimental data, and specific views of the overall KG can be extracted according to the bio-medical problem to be studied.
CRFeb 5
Interpreting Manifolds and Graph Neural Embeddings from Internet of Things Traffic FlowsEnrique Feito-Casares, Francisco M. Melgarejo-Meseguer, Elena Casiraghi et al.
The rapid expansion of Internet of Things (IoT) ecosystems has led to increasingly complex and heterogeneous network topologies. Traditional network monitoring and visualization tools rely on aggregated metrics or static representations, which fail to capture the evolving relationships and structural dependencies between devices. Although Graph Neural Networks (GNNs) offer a powerful way to learn from relational data, their internal representations often remain opaque and difficult to interpret for security-critical operations. Consequently, this work introduces an interpretable pipeline that generates directly visualizable low-dimensional representations by mapping high-dimensional embeddings onto a latent manifold. This projection enables the interpretable monitoring and interoperability of evolving network states, while integrated feature attribution techniques decode the specific characteristics shaping the manifold structure. The framework achieves a classification F1-score of 0.830 for intrusion detection while also highlighting phenomena such as concept drift. Ultimately, the presented approach bridges the gap between high-dimensional GNN embeddings and human-understandable network behavior, offering new insights for network administrators and security analysts.
LGOct 12, 2021
GRAPE for Fast and Scalable Graph Processing and random walk-based EmbeddingLuca Cappelletti, Tommaso Fontana, Elena Casiraghi et al.
Graph Representation Learning (GRL) methods opened new avenues for addressing complex, real-world problems represented by graphs. However, many graphs used in these applications comprise millions of nodes and billions of edges and are beyond the capabilities of current methods and software implementations. We present GRAPE, a software resource for graph processing and embedding that can scale with big graphs by using specialized and smart data structures, algorithms, and a fast parallel implementation of random walk-based methods. Compared with state-of-the-art software resources, GRAPE shows an improvement of orders of magnitude in empirical space and time complexity, as well as a competitive edge and node label prediction performance. GRAPE comprises about 1.7 million well-documented lines of Python and Rust code and provides 69 node embedding methods, 25 inference models, a collection of efficient graph processing utilities and over 80,000 graphs from the literature and other sources. Standardized interfaces allow seamless integration of third-party libraries, while ready-to-use and modular pipelines permit an easy-to-use evaluation of GRL methods, therefore also positioning GRAPE as a software resource to perform a fair comparison between methods and libraries for graph processing and embedding.
LGJan 5, 2021
Het-node2vec: second order random walk sampling for heterogeneous multigraphs embeddingMauricio Soto-Gomez, Peter Robinson, Carlos Cano et al.
Many real-world problems are naturally modeled as heterogeneous graphs, where nodes and edges represent multiple types of entities and relations. Existing learning models for heterogeneous graph representation usually depend on the computation of specific and user-defined heterogeneous paths, or in the application of large and often not scalable deep neural network architectures. We propose Het-node2vec, an extension of the node2vec algorithm, designed for embedding heterogeneous graphs. Het-node2vec addresses the challenge of capturing the topological and structural characteristics of graphs and the semantic information underlying the different types of nodes and edges of heterogeneous graphs, by introducing a simple stochastic node and edge type switching strategy in second order random walk processes. The proposed approach also introduces an ''attention mechanism'' to focus the random walks on specific node and edge types, thus allowing more accurate embeddings and more focused predictions on specific node and edge types of interest. Empirical results on benchmark datasets show that Hetnode2vec achieves comparable or superior performance with respect to state-of-the-art methods for heterogeneous graphs in node label and edge prediction tasks.
LGApr 10, 2019
Multitask Hopfield NetworksMarco Frasca, Giuliano Grossi, Giorgio Valentini
Multitask algorithms typically use task similarity information as a bias to speed up and improve the performance of learning processes. Tasks are learned jointly, sharing information across them, in order to construct models more accurate than those learned separately over single tasks. In this contribution, we present the first multitask model, to our knowledge, based on Hopfield Networks (HNs), named HoMTask. We show that by appropriately building a unique HN embedding all tasks, a more robust and effective classification model can be learned. HoMTask is a transductive semi-supervised parametric HN, that minimizes an energy function extended to all nodes and to all tasks under study. We provide theoretical evidence that the optimal parameters automatically estimated by HoMTask make coherent the model itself with the prior knowledge (connection weights and node labels). The convergence properties of HNs are preserved, and the fixed point reached by the network dynamics gives rise to the prediction of unlabeled nodes. The proposed model improves the classification abilities of singletask HNs on a preliminary benchmark comparison, and achieves competitive performance with state-of-the-art semi-supervised graph-based algorithms.
AIJun 17, 2014
Notes on hierarchical ensemble methods for DAG-structured taxonomiesGiorgio Valentini
Several real problems ranging from text classification to computational biology are characterized by hierarchical multi-label classification tasks. Most of the methods presented in literature focused on tree-structured taxonomies, but only few on taxonomies structured according to a Directed Acyclic Graph (DAG). In this contribution novel classification ensemble algorithms for DAG-structured taxonomies are introduced. In particular Hierarchical Top-Down (HTD-DAG) and True Path Rule (TPR-DAG) for DAGs are presented and discussed.