SemRepo: A Knowledge Graph for Research Software and Its Scholarly Ecosystem
For researchers studying the scholarly ecosystem, SemRepo provides a unified infrastructure to analyze software sustainability and reproducibility across repositories and publications, though it is an incremental resource.
SemRepo is an RDF knowledge graph with over 81 million triples describing nearly 200,000 GitHub repositories linked to scholarly knowledge graphs, enabling cross-platform queries and analyses of research software and its scholarly context.
We present SemRepo, an RDF knowledge graph comprising over 81 million triples describing nearly 200,000 GitHub repositories associated with scientific research. SemRepo captures repository-level metadata, such as contributors, issues, and programming languages, and interlinks this information with external scholarly knowledge graphs. In particular, repository authors are linked to their profiles in SemOpenAlex, repositories are connected to scholarly publications in LPWC, and research artifacts, such as datasets and experiments, are linked via MLSea-KG. This integration enables queries that span publications and their scholarly artifacts, which are typically fragmented across separate platforms. SemRepo supports analyses that are difficult to perform with existing resources in isolation, including provenance reconstruction across repositories and publications, as well as the systematic identification of risks to research reproducibility and software sustainability. By unifying research software with its scholarly context in a single graph, SemRepo provides an important infrastructure for large-scale analysis of software within the broader scientific research ecosystem.