SEDec 7, 2020

A Tool to Extract Structured Data from GitHub

arXiv:2012.03453v16 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This tool addresses the lack of a systematic dataset of open-source project information on GitHub, which is useful for researchers and developers interested in knowledge acquisition and mining of repository data.

The authors developed GitRepository, a tool that extracts structured data from GitHub repositories. From an initial 1680 repositories, the tool generated a dataset of 247 repositories after applying all pre-defined filters, saving the data as CSV and DB files.

GitHub repositories consist of various detailed information about the project contributors, the number of commits and its contributors, releases, pull requests, programming languages, and issues. However, no systematic dataset of open source projects exists which features detailed information about the repositories on GitHub for knowledge acquisition and mining. In this paper, we developed tool support, named GitRepository, which helps in creating a data-set of repositories based on the proposed schema. Out of initial 1680 repositories, the dataset hosts 620 repositories (with applied basic filters of stars and forks), and 247 repositories (after applying all pre-defined filters). The tool extracts the information of GitHub repositories and saves the data in the form of CSV. files and a database (.DB) file.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes