Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale
This addresses the problem of tracking provenance for all publicly available source code, which is crucial for developers and researchers, but the approach is incremental as it builds on existing data models.
The study quantified the exponential growth of original source code files and commits over 40 years in the Software Heritage archive, and observed a combinatorial explosion in the duplication of identical files across different contexts, benchmarking data models to track provenance at scale.
We study the evolution of the largest known corpus of publicly available source code, i.e., the Software Heritage archive (4B unique source code files, 1B commits capturing their development histories across 50M software projects). On such corpus we quantify the growth rate of original, never-seen-before source code files and commits. We find the growth rates to be exponential over a period of more than 40 years.We then estimate the multiplication factor, i.e., how much the same artifacts (e.g., files or commits) appear in different contexts (e.g., commits or source code distribution places). We observe a combinatorial explosion in the multiplication of identical source code files across different commits.We discuss the implication of these findings for the problem of tracking the provenance of source code artifacts (e.g., where and when a given source code file or commit has been observed in the wild) for the entire body of publicly available source code. To that end we benchmark different data models for capturing software provenance information at this scale and growth rate. We identify a viable solution that is deployable on commodity hardware and appears to be maintainable for the foreseeable future.