SEFeb 7, 2019

How Different Are Different diff Algorithms in Git?

arXiv:1902.02467v457 citations
Originality Synthesis-oriented
AI Analysis

This addresses the problem of inconsistent results in code repository mining for researchers and developers, but it is incremental as it compares existing algorithms rather than introducing new ones.

The study investigated how different diff algorithms in Git affect code mining tasks, finding that algorithm choice changes code churn metrics in 1.7% to 8.2% of commits and bug-introducing change identification in 6.0% to 13.3% of bug-fix commits, with the Histogram algorithm recommended for better accuracy.

Automatic identification of the differences between two versions of a file is a common and basic task in several applications of mining code repositories. Git, a version control system, has a diff utility and users can select algorithms of diff from the default algorithm Myers to the advanced Histogram algorithm. From our systematic mapping, we identified three popular applications of diff in recent studies. On the impact on code churn metrics in 14 Java projects, we obtained different values in 1.7% to 8.2% commits based on the different diff algorithms. Regarding bug-introducing change identification, we found 6.0% and 13.3% in the identified bug-fix commits had different results of bug-introducing changes from 10 Java projects. For patch application, we found that the Histogram is more suitable than Myers for providing the changes of code, from our manual analysis. Thus, we strongly recommend using the Histogram algorithm when mining Git repositories to consider differences in source code.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes