87.8SEMay 11Code
CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ CommitsTommy Ho, Khashayar Etemadi, Zhendong Su
Recent progress in automated repair of performance bugs demands realistic, executable benchmarks. However, existing C++ performance benchmarks are largely built from competitive programming submissions, and recent real-world benchmarks predominantly target Python and .NET. To fill this gap, we present CppPerf-Mine, a configurable pipeline that mines execution-time-improving patches from open-source C++ repositories on GitHub by combining structural commit filtering, an LLM-based commit classifier, and a containerized build & test stage that produces fully reproducible Docker images for each patch. Using CppPerf-Mine, we build CppPerf-DB, a benchmark comprising 347 manually verified patches from 42 mature C++ repositories, 39% of which are multi-file, enabling the evaluation of repository-level repair tools. In our preliminary study, OpenHands correctly fixes only 13.5% of the patches in CppPerf-DB, confirming that real-world C++ performance repair remains an open challenge. CppPerf-Mine and CppPerf-DB are open-source and publicly available at: https://doi.org/10.5281/zenodo.20097425. In addition, a demonstration video is available at: https://www.youtube.com/watch?v=nixlupIgSdM.
SEFeb 9, 2024
CigaR: Cost-efficient Program Repair with LLMsDávid Hidvégi, Khashayar Etemadi, Sofia Bobadilla et al.
Large language models (LLM) have proven to be effective at automated program repair (APR). However, using LLMs can be costly, with companies invoicing users by the number of tokens. In this paper, we propose CigaR, the first LLM-based APR tool that focuses on minimizing the repair cost. CigaR works in two major steps: generating a first plausible patch and multiplying plausible patches. CigaR optimizes the prompts and the prompt setting to maximize the information given to LLMs using the smallest possible number of tokens. Our experiments on 429 bugs from the widely used Defects4J and HumanEval-Java datasets shows that CigaR reduces the token cost by 73%. On average, CigaR spends 127k tokens per bug while the baseline uses 467k tokens per bug. On the subset of bugs that are fixed by both, CigaR spends 20k per bug while the baseline uses 608k tokens, a cost saving of 96%. Our extensive experiments show that CigaR is a cost-effective LLM-based program repair tool that uses a low number of tokens to automatically generate patches.
SEJan 31, 2024
Generative AI to Generate Test Data GeneratorsBenoit Baudry, Khashayar Etemadi, Sen Fang et al.
Generating fake data is an essential dimension of modern software testing, as demonstrated by the number and significance of data faking libraries. Yet, developers of faking libraries cannot keep up with the wide range of data to be generated for different natural languages and domains. In this paper, we assess the ability of generative AI for generating test data in different domains. We design three types of prompts for Large Language Models (LLMs), which perform test data generation tasks at different levels of integrability: 1) raw test data generation, 2) synthesizing programs in a specific language that generate useful test data, and 3) producing programs that use state-of-the-art faker libraries. We evaluate our approach by prompting LLMs to generate test data for 11 domains. The results show that LLMs can successfully generate realistic test data generators in a wide range of domains at all three levels of integrability.
SEMar 22, 2021
Sorald: Automatic Patch Suggestions for SonarQube Static Analysis ViolationsKhashayar Etemadi, Nicolas Harrand, Simon Larsen et al.
Previous work has shown that early resolution of issues detected by static code analyzers can prevent major costs later on. However, developers often ignore such issues for two main reasons. First, many issues should be interpreted to determine if they correspond to actual flaws in the program. Second, static analyzers often do not present the issues in a way that is actionable. To address these problems, we present Sorald: a novel system that devise metaprogramming templates to transform the abstract syntax trees of programs and suggest fixes for static analysis warnings. Thus, the burden on the developer is reduced from interpreting and fixing static issues, to inspecting and approving full fledged solutions. Sorald fixes violations of 10 rules from SonarJava, one of the most widely used static analyzers for Java. We evaluate Sorald on a dataset of 161 popular repositories on Github. Our analysis shows the effectiveness of Sorald as it fixes 65% (852/1,307) of the violations that meets the repair preconditions. Overall, our experiments show it is possible to automatically fix notable violations of the static analysis rules produced by the state-of-the-art static analyzer SonarJava.
SEDec 12, 2020
A Software-Repair Robot based on Continual LearningBenoit Baudry, Zimin Chen, Khashayar Etemadi et al.
Software bugs are common and correcting them accounts for a significant part of costs in the software development and maintenance process. This calls for automatic techniques to deal with them. One promising direction towards this goal is gaining repair knowledge from historical bug fixing examples. Retrieving insights from software development history is particularly appealing with the constant progress of machine learning paradigms and skyrocketing `big' bug fixing data generated through Continuous Integration (CI). In this paper, we present R-Hero, a novel software repair bot that applies continual learning to acquire bug fixing strategies from continuous streams of source code changes, implemented for the single development platform Github/Travis CI. We describe R-Hero, our novel system for learning how to fix bugs based on continual training, and we uncover initial successes as well as novel research challenges for the community.
SEOct 5, 2020
On the Relevance of Cross-project Learning with Nearest Neighbours for Commit Message GenerationKhashayar Etemadi, Martin Monperrus
Commit messages play an important role in software maintenance and evolution. Nonetheless, developers often do not produce high-quality messages. A number of commit message generation methods have been proposed in recent years to address this problem. Some of these methods are based on neural machine translation (NMT) techniques. Studies show that the nearest neighbor algorithm (NNGen) outperforms existing NMT-based methods, although NNGen is simpler and faster than NMT. In this paper, we show that NNGen does not take advantage of cross-project learning in the majority of the cases. We also show that there is an even simpler and faster variation of the existing NNGen method which outperforms it in terms of the BLEU_4 score without using cross-project learning.
SEJul 14, 2020
Estimating the Potential of Program Repair Search Spaces with Commit AnalysisKhashayar Etemadi, Niloofar Tarighat, Siddharth Yadav et al.
The most natural method for evaluating program repair systems is to run them on bug datasets, such as Defects4J. Yet, using this evaluation technique on arbitrary real-world programs requires heavy configuration. In this paper, we propose a purely static method to evaluate the potential of the search space of repair approaches. This new method enables researchers and practitioners to encode the search spaces of repair approaches and select potentially useful ones without struggling with tool configuration and execution. We encode the search spaces by specifying the repair strategies they employ. Next, we use the specifications to check whether past commits lie in repair search spaces. For a repair approach, including many human-written past commits in its search space indicates its potential to generate useful patches. We implement our evaluation method in LighteR. LighteR gets a Git repository and outputs a list of commits whose source code changes lie in repair search spaces. We run LighteR on 55,309 commits from the history of 72 Github repositories with and show that LighteR's precision and recall are 77% and 92%, respectively. Overall, our experiments show that our novel method is both lightweight and effective to study the search space of program repair approaches.
SEAug 18, 2019
Incorporating fault-proneness estimations into coverage-based test case prioritization methodsMostafa Mahdieh, Seyed-Hassan Mirian-Hosseinabadi, Khashayar Etemadi et al.
Context: During the development process of a software program, regression testing is used to ensure that the correct behavior of the software is retained after updates to the source code. This regression testing becomes costly over time as the number of test cases increases and it makes sense to prioritize test cases in order to execute fault-detecting test cases as soon as possible. There are many coverage-based test case prioritization (TCP) methods that only use the code coverage data to prioritize test cases. By incorporating the fault-proneness estimations of code units into the coverage-based TCP methods, we can improve such techniques. Objective: In this paper, we aim to propose an approach which improves coverage-based TCP methods by considering the fault-proneness distribution over code units. Further, we present the results of an empirical study that shows using our proposed approach significantly improves the additional strategy, which is a widely used coverage-based TCP method. Method: The approach presented in this study uses the bug history of the software in order to introduce a defect prediction method to learn a neural network model. This model is then used to estimate fault-proneness of each area of the source code and then the estimations are incorporated into coverage-based TCP methods. Our proposed approach is a general idea that can be applied to many coverage-based methods, such as the additional and total TCP methods. Results: The proposed methods are evaluated on datasets collected from the development history of five real-world projects including 357 versions in total. The experiments show that using an appropriate bug history can improve coverage-based TCP methods.