The Java Build Framework: Large Scale Compilation
This addresses the need for researchers in software engineering and static analysis to have reliable, scalable datasets for generalizable studies beyond small benchmarks.
The authors tackled the problem of lacking large compilable and runnable Java datasets for research by developing the Java Build Framework, which automatically compiles a large percentage of Java projects from open-source repositories like GitHub.
Large repositories of source code for research tend to limit their utility to static analysis of the code, as they give no guarantees on whether the projects are compilable, much less runnable in any way. The immediate consequence of the lack of large compilable and runnable datasets is that research that requires such properties does not generalize beyond small benchmarks. We present the Java Build Framework, a method and tool capable of automatically compiling a large percentage of Java projects available in open source repositories like GitHub. Two elements are at the core: a very large repository of JAR files, and techniques of resolution of compilation faults and dependencies.