SEJan 17, 2019Code
Bears: An Extensible Java Bug Benchmark for Automatic Program Repair StudiesFernanda Madeiral, Simon Urli, Marcelo Maia et al.
Benchmarks of bugs are essential to empirically evaluate automatic program repair tools. In this paper, we present Bears, a project for collecting and storing bugs into an extensible bug benchmark for automatic repair studies in Java. The collection of bugs relies on commit building state from Continuous Integration (CI) to find potential pairs of buggy and patched program versions from open-source projects hosted on GitHub. Each pair of program versions passes through a pipeline where an attempt of reproducing a bug and its patch is performed. The core step of the reproduction pipeline is the execution of the test suite of the program on both program versions. If a test failure is found in the buggy program version candidate and no test failure is found in its patched program version candidate, a bug and its patch were successfully reproduced. The uniqueness of Bears is the usage of CI (builds) to identify buggy and patched program version candidates, which has been widely adopted in the last years in open-source projects. This approach allows us to collect bugs from a diversity of projects beyond mature projects that use bug tracking systems. Moreover, Bears was designed to be publicly available and to be easily extensible by the research community through automatic creation of branches with bugs in a given GitHub repository, which can be used for pull requests in the Bears repository. We present in this paper the approach employed by Bears, and we deliver the version 1.0 of Bears, which contains 251 reproducible bugs collected from 72 projects that use the Travis CI and Maven build environment.
SEOct 2, 2021
Evaluating Code Readability and Legibility: An Examination of Human-centric StudiesDelano Oliveira, Reydne Bruno, Fernanda Madeiral et al.
Reading code is an essential activity in software maintenance and evolution. Several studies with human subjects have investigated how different factors, such as the employed programming constructs and naming conventions, can impact code readability, i.e., what makes a program easier or harder to read and apprehend by developers, and code legibility, i.e., what influences the ease of identifying elements of a program. These studies evaluate readability and legibility by means of different comprehension tasks and response variables. In this paper, we examine these tasks and variables in studies that compare programming constructs, coding idioms, naming conventions, and formatting guidelines, e.g., recursive vs. iterative code. To that end, we have conducted a systematic literature review where we found 54 relevant papers. Most of these studies evaluate code readability and legibility by measuring the correctness of the subjects' results (83.3%) or simply asking their opinions (55.6%). Some studies (16.7%) rely exclusively on the latter variable.There are still few studies that monitor subjects' physical signs, such as brain activation regions (5%). Moreover, our study shows that some variables are multi-faceted. For instance, correctness can be measured as the ability to predict the output of a program, answer questions about its behavior, or recall parts of it. These results make it clear that different evaluation approaches require different competencies from subjects, e.g., tracing the program vs. summarizing its goal vs. memorizing its text. To assist researchers in the design of new studies and improve our comprehension of existing ones, we model program comprehension as a learning activity by adapting a preexisting learning taxonomy. This adaptation indicates that some competencies are often exercised in these evaluations whereas others are rarely targeted.
SEAug 10, 2021
Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff SizeMartin Monperrus, Matias Martinez, He Ye et al.
This paper presents Megadiff, a dataset of source code diffs. It focuses on Java, with strict inclusion criteria based on commit message and diff size. Megadiff contains 663 029 Java diffs that can be used for research on commit comprehension, fault localization, automated program repair, and machine learning on code changes.
SEApr 6, 2021
A large-scale study on human-cloned changes for automated program repairFernanda Madeiral, Thomas Durieux
Research in automatic program repair has shown that real bugs can be automatically fixed. However, there are several challenges involved in such a task that are not yet fully addressed. As an example, consider that a test-suite-based repair tool performs a change in a program to fix a bug spotted by a failing test case, but then the same or another test case fails. This could mean that the change is a partial fix for the bug or that another bug was manifested. However, the repair tool discards the change and possibly performs other repair attempts. One might wonder if the applied change should be also applied in other locations in the program so that the bug is fully fixed. In this paper, we are interested in investigating the extent of bug fix changes being cloned by developers within patches. Our goal is to investigate the need of multi-location repair by using identical or similar changes in identical or similar contexts. To do so, we analyzed 3,049 multi-hunk patches from the ManySStuBs4J dataset, which is a large dataset of single statement bug fix changes. We found out that 68% of the multi-hunk patches contain at least one change clone group. Moreover, most of these patches (70%) are strictly-cloned ones, which are patches fully composed of changes belonging to one single change clone group. Finally, most of the strictly-cloned patches (89%) contain change clones with identical changes, independently of their contexts. We conclude that automated solutions for creating patches composed of identical or similar changes can be useful for fixing bugs.
SEMar 22, 2021
Sorald: Automatic Patch Suggestions for SonarQube Static Analysis ViolationsKhashayar Etemadi, Nicolas Harrand, Simon Larsen et al.
Previous work has shown that early resolution of issues detected by static code analyzers can prevent major costs later on. However, developers often ignore such issues for two main reasons. First, many issues should be interpreted to determine if they correspond to actual flaws in the program. Second, static analyzers often do not present the issues in a way that is actionable. To address these problems, we present Sorald: a novel system that devise metaprogramming templates to transform the abstract syntax trees of programs and suggest fixes for static analysis warnings. Thus, the burden on the developer is reduced from interpreting and fixing static issues, to inspecting and approving full fledged solutions. Sorald fixes violations of 10 rules from SonarJava, one of the most widely used static analyzers for Java. We evaluate Sorald on a dataset of 161 popular repositories on Github. Our analysis shows the effectiveness of Sorald as it fixes 65% (852/1,307) of the violations that meets the repair preconditions. Overall, our experiments show it is possible to automatically fix notable violations of the static analysis rules produced by the state-of-the-art static analyzer SonarJava.
SEMay 28, 2019
Empirical Review of Java Program Repair Tools: A Large-Scale Experiment on 2,141 Bugs and 23,551 Repair AttemptsThomas Durieux, Fernanda Madeiral, Matias Martinez et al.
In the past decade, research on test-suite-based automatic program repair has grown significantly. Each year, new approaches and implementations are featured in major software engineering venues. However, most of those approaches are evaluated on a single benchmark of bugs, which are also rarely reproduced by other researchers. In this paper, we present a large-scale experiment using 11 Java test-suite-based repair tools and 5 benchmarks of bugs. Our goal is to have a better understanding of the current state of automatic program repair tools on a large diversity of benchmarks. Our investigation is guided by the hypothesis that the repairability of repair tools might not be generalized across different benchmarks of bugs. We found that the 11 tools 1) are able to generate patches for 21% of the bugs from the 5 benchmarks, and 2) have better performance on Defects4J compared to other benchmarks, by generating patches for 47% of the bugs from Defects4J compared to 10-30% of bugs from the other benchmarks. Our experiment comprises 23,551 repair attempts in total, which we used to find the causes of non-patch generation. These causes are reported in this paper, which can help repair tool designers to improve their techniques and tools.
SEApr 3, 2019
Styler: learning formatting conventions to repair Checkstyle violationsBenjamin Loriot, Fernanda Madeiral, Martin Monperrus
Ensuring the consistent usage of formatting conventions is an important aspect of modern software quality assurance. While formatting convention violations can be automatically detected by format checkers implemented in linters, there is no satisfactory solution for repairing them. Manually fixing formatting convention violations is a waste of developer time and code formatters do not take into account the conventions adopted and configured by developers for the used linter. In this paper, we present Styler, a tool dedicated to fixing formatting rule violations raised by format checkers using a machine learning approach. For a given project, Styler first generates training data by injecting violations of the project-specific rules in violation-free source code files. Then, it learns fixes by feeding long short-term memory neural networks with the training data encoded into token sequences. Finally, it predicts fixes for real formatting violations with the trained models. Currently, Styler supports a single checker, Checkstyle, which is a highly configurable and popular format checker for Java. In an empirical evaluation, Styler repaired 41% of 26,791 Checkstyle violations mined from 104 GitHub projects. Moreover, we compared Styler with the IntelliJ plugin CheckStyle-IDEA and the machine-learning-based code formatters Naturalize and CodeBuff. We found out that Styler fixes violations of a diverse set of Checkstyle rules (24/25 rules), generates smaller repairs in comparison to the other systems, and predicts repairs in seconds once trained on a project. Through a manual analysis, we identified cases in which Styler does not succeed to generate correct repairs, which can guide further improvements in Styler. Finally, the results suggest that Styler can be useful to help developers repair Checkstyle formatting violations.
SEMar 21, 2019
Bootstrapping Cookbooks for APIs from Crowd Knowledge on Stack OverflowLucas B. L. Souza, Eduardo C. Campos, Fernanda Madeiral et al.
Well established libraries typically have API documentation. However, they frequently lack examples and explanations, possibly making difficult their effective reuse. Stack Overflow is a question-and-answer website oriented to issues related to software development. Despite the increasing adoption of Stack Overflow, the information related to a particular topic (e.g., an API) is spread across the website. Thus, Stack Overflow still lacks organization of the crowd knowledge available on it. Our target goal is to address the problem of the poor quality documentation for APIs by providing an alternative artifact to document them based on the crowd knowledge available on Stack Overflow, called crowd cookbook. A cookbook is a recipe-oriented book, and we refer to our cookbook as crowd cookbook since it contains content generated by a crowd. The cookbooks are meant to be used through an exploration process, i.e. browsing. In this paper, we present a semi-automatic approach that organizes the crowd knowledge available on Stack Overflow to build cookbooks for APIs. We have generated cookbooks for three APIs widely used by the software development community: SWT, LINQ and QT. We have also defined desired properties that crowd cookbooks must meet, and we conducted an evaluation of the cookbooks against these properties with human subjects. The results showed that the cookbooks built using our approach, in general, meet those properties. As a highlight, most of the recipes were considered appropriate to be in the cookbooks and have self-contained information. We concluded that our approach is capable to produce adequate cookbooks automatically, which can be as useful as manually produced cookbooks. This opens an opportunity for API designers to enrich existent cookbooks with the different points of view from the crowd, or even to generate initial versions of new cookbooks.
SEJul 30, 2018
Towards an automated approach for bug fix pattern detectionFernanda Madeiral, Thomas Durieux, Victor Sobreira et al.
The characterization of bug datasets is essential to support the evaluation of automatic program repair tools. In a previous work, we manually studied almost 400 human-written patches (bug fixes) from the Defects4J dataset and annotated them with properties, such as repair patterns. However, manually finding these patterns in different datasets is tedious and time-consuming. To address this activity, we designed and implemented PPD, a detector of repair patterns in patches, which performs source code change analysis at abstract-syntax tree level. In this paper, we report on PPD and its evaluation on Defects4J, where we compare the results from the automated detection with the results from the previous manual analysis. We found that PPD has overall precision of 91% and overall recall of 92%, and we conclude that PPD has the potential to detect as many repair patterns as human manual analysis.
SEJan 19, 2018
Dissection of a Bug Dataset: Anatomy of 395 Patches from Defects4JVictor Sobreira, Thomas Durieux, Fernanda Madeiral et al.
Well-designed and publicly available datasets of bugs are an invaluable asset to advance research fields such as fault localization and program repair as they allow directly and fairly comparison between competing techniques and also the replication of experiments. These datasets need to be deeply understood by researchers: the answer for questions like "which bugs can my technique handle?" and "for which bugs is my technique effective?" depends on the comprehension of properties related to bugs and their patches. However, such properties are usually not included in the datasets, and there is still no widely adopted methodology for characterizing bugs and patches. In this work, we deeply study 395 patches of the Defects4J dataset. Quantitative properties (patch size and spreading) were automatically extracted, whereas qualitative ones (repair actions and patterns) were manually extracted using a thematic analysis-based approach. We found that 1) the median size of Defects4J patches is four lines, and almost 30% of the patches contain only addition of lines; 2) 92% of the patches change only one file, and 38% has no spreading at all; 3) the top-3 most applied repair actions are addition of method calls, conditionals, and assignments, occurring in 77% of the patches; and 4) nine repair patterns were found for 95% of the patches, where the most prevalent, appearing in 43% of the patches, is on conditional blocks. These results are useful for researchers to perform advanced analysis on their techniques' results based on Defects4J. Moreover, our set of properties can be used to characterize and compare different bug datasets.