Unsupervised Construction of Knowledge Graphs From Text and Code
This work addresses the challenge of organizing and comparing software in the open science ecosystem for scientists, but it is incremental as it builds on existing word-embedding and clustering techniques.
The authors tackled the problem of constructing knowledge graphs from scientific literature and source code using unsupervised learning to identify conceptual entities and link them to code elements, resulting in a pipeline that enhances corpus-wide understanding and assists scientists in building on existing models.
The scientific literature is a rich source of information for data mining with conceptual knowledge graphs; the open science movement has enriched this literature with complementary source code that implements scientific models. To exploit this new resource, we construct a knowledge graph using unsupervised learning methods to identify conceptual entities. We associate source code entities to these natural language concepts using word embedding and clustering techniques. Practical naming conventions for methods and functions tend to reflect the concept(s) they implement. We take advantage of this specificity by presenting a novel process for joint clustering text concepts that combines word-embeddings, nonlinear dimensionality reduction, and clustering techniques to assist in understanding, organizing, and comparing software in the open science ecosystem. With our pipeline, we aim to assist scientists in building on existing models in their discipline when making novel models for new phenomena. By combining source code and conceptual information, our knowledge graph enhances corpus-wide understanding of scientific literature.