SEFeb 10, 2020

A Dataset of Enterprise-Driven Open Source Software

arXiv:2002.03927v213 citationsHas Code
AI Analysis

This dataset addresses generalizability concerns for researchers studying open source software, but it is incremental as it focuses on a specific subset of projects.

The authors tackled the problem of limited generalizability in open source software research by creating a dataset of 17,264 enterprise-driven GitHub projects, achieving 89% identification accuracy through heuristics based on email domains.

We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes