SEAug 10, 2021

Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size

arXiv:2108.04631v119 citations
Originality Synthesis-oriented
AI Analysis

This provides a large-scale dataset for researchers in software engineering and machine learning on code, but it is incremental as it focuses on Java with specific criteria.

The paper introduces Megadiff, a dataset of 663,029 Java source code diffs categorized by diff size, designed to support research in areas such as commit comprehension and automated program repair.

This paper presents Megadiff, a dataset of source code diffs. It focuses on Java, with strict inclusion criteria based on commit message and diff size. Megadiff contains 663 029 Java diffs that can be used for research on commit comprehension, fault localization, automated program repair, and machine learning on code changes.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes