SEMar 11Code
Unveiling Practical Shortcomings of Patch Overfitting Detection TechniquesDavid Williams, Ioakim Avraam, Aldeida Aleti et al.
Automated Program Repair (APR) can reduce the time developers spend debugging, allowing them to focus on other aspects of software development. Automatically generated bug patches are typically validated through software testing. However, this method can lead to patch overfitting, i.e., generating patches that pass the given tests but are still incorrect. Patch correctness assessment (also known as overfitting detection) techniques have been proposed to identify patches that overfit. However, prior work often assessed the effectiveness of these techniques in isolation and on datasets that do not reflect the distribution of correct-to-overfitting patches that would be generated by APR tools in typical use; thus, we still do not know their effectiveness in practice. This work presents the first comprehensive benchmarking study of several patch overfitting detection (POD) methods in a practical scenario. To this end, we curate datasets that reflect realistic assumptions (i.e., patches produced by tools run under the same experimental conditions). Next, we use these data to benchmark six state-of-the-art POD approaches -- spanning static analysis, dynamic testing, and learning-based approaches -- against two baselines based on random sampling (one from prior work and one proposed herein). Our results are striking: Simple random selection outperforms all POD tools for 71% to 96% of cases, depending on the POD tool. This suggests two main takeaways: (1) current POD tools offer limited practical benefit, highlighting the need for novel techniques; (2) any POD tool must be benchmarked on realistic data and against random sampling to prove its practical effectiveness. To this end, we encourage the APR community to continue improving POD techniques and to adopt our proposed methodology for practical benchmarking; we make our data and code available to facilitate such adoption.
LGFeb 2, 2023
Energy Efficiency of Training Neural Network Architectures: An Empirical StudyYinlena Xu, Silverio Martínez-Fernández, Matias Martinez et al.
The evaluation of Deep Learning models has traditionally focused on criteria such as accuracy, F1 score, and related measures. The increasing availability of high computational power environments allows the creation of deeper and more complex models. However, the computations needed to train such models entail a large carbon footprint. In this work, we study the relations between DL model architectures and their environmental impact in terms of energy consumed and CO$_2$ emissions produced during training by means of an empirical study using Deep Convolutional Neural Networks. Concretely, we study: (i) the impact of the architecture and the location where the computations are hosted on the energy consumption and emissions produced; (ii) the trade-off between accuracy and energy efficiency; and (iii) the difference on the method of measurement of the energy consumed using software-based and hardware-based tools.
SEAug 2, 2024Code
The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace PipelinesMatias Martinez
The recent surge of open-source large language models (LLMs) enables developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance, thereby providing governance and ownership of the model deployment process. To utilize these LLMs, inference engines are needed. These engines load the model's weights onto available resources, such as GPUs, and process queries to generate responses. The speed of inference, or performance, of the LLM, is critical for real-time applications, as it computes millions or billions of floating point operations per inference. Recently, advanced inference engines such as vLLM have emerged, incorporating novel mechanisms such as efficient memory management to achieve state-of-the-art performance. In this paper, we analyze the performance, particularly the throughput (tokens generated per unit of time), of 20 LLMs using two inference libraries: vLLM and HuggingFace's pipelines. We investigate how various hyperparameters, which developers must configure, influence inference performance. Our results reveal that throughput landscapes are irregular, with distinct peaks, highlighting the importance of hyperparameter optimization to achieve maximum performance. We also show that applying hyperparameter optimization when upgrading or downgrading the GPU model used for inference can improve throughput from HuggingFace pipelines by an average of 9.16% and 13.7%, respectively.
SEJun 20, 2025Code
Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair SystemsMatias Martinez, Xavier Franch
The rapid progress in Automated Program Repair (APR) has been driven by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a recent benchmark designed to evaluate LLM-based repair systems using real issues and pull requests mined from 12 popular open-source Python repositories. Its public leaderboards -- SWE-Bench Lite and SWE-Bench Verified -- have become central platforms for tracking progress and comparing solutions. However, because the submission process does not require detailed documentation, the architectural design and origin of many solutions remain unclear. In this paper, we present the first comprehensive study of all submissions to the SWE-Bench Lite (79 entries) and Verified (99 entries) leaderboards, analyzing 80 unique approaches across dimensions such as submitter type, product availability, LLM usage, and system architecture. Our findings reveal the dominance of proprietary LLMs (especially Claude 3.5), the presence of both agentic and non-agentic designs, and a contributor base spanning from individual developers to large tech companies.
SEMar 17, 2021Code
Learning migration models for supporting incremental language migrations of software applicationsBruno Góis Mateus, Matias Martinez, Christophe Kolski
Context: A Legacy system can be defined as a system that significantly resists modification and evolution. According to the literature, there are two main strategies to migrate a legacy system: (a) to replace the legacy system by a new one, (b) to incrementally migrate parts from the legacy system to the new one. Incremental migration allows developers to better control the risks that may occur during the migration process. However, this strategy is more complex because it requires decomposition of the legacy system into different parts, e.g. a set of files, and to define the order of migration of them along the migration process. To our knowledge, there is no approach to support developers on those activities. Objective: This paper presents an approach, named MigrationExp, to support incremental language migrations of applications from one source language to another target language. MigrationExp recommends the files that should be migrated first in a particular migration iteration. As a novelty, our approach relies on a ranking model learned, using a learning-to-rank algorithm, from migrations made by developers. Method: We validate our approach in the context of the migrations of Android apps, from Java to Kotlin, a new official language for Android. We train our model using migrations of Java code to Kotlin written by developers on open-source applications. Results: The results show that, on the task of proposing files to migrate, our approach outperforms a previous migration strategy proposed by Google, in terms of its ability to accurately predict empirically observed migration orders. Conclusion: Since most Android applications are written in Java, we conclude that approaches to support developers such as MigrationExp may significantly impact the development of Android applications.
SEMar 28, 2020Code
Why did developers migrate Android applications from Java to Kotlin?Matias Martinez, Bruno Gois Mateus
Currently, the majority of apps running on mobile devices are Android apps developed in Java. However, developers can now write Android applications using a new programming language: Kotlin, which Google adopted in 2017 as an official programming language for developing Android apps. Since then, Android developers have been able to: a) start writing Android applications from scratch using Kotlin, b) evolve their existing Android applications written in Java by adding Kotlin code (possible thanks to the interoperability between the two languages), or c) migrate their Android apps from Java to Kotlin. This paper aims to study this last case. We conducted a qualitative study to find out why Android developers have migrated Java code to Kotlin and to bring together their experiences about the process, in order to identify the main difficulties they have faced. To execute the study, we first identified commits from open-source Android projects that have migrated Java code to Kotlin. Then, we emailed the developers that wrote those migrations. We thus obtained information from 98 developers who had migrated code from Java to Kotlin. This paper presents the main reasons identified by the study for performing the migration. We found that developers migrated Java code to Kotlin in order to access programming language features (e.g., extension functions, lambdas, smart casts) that are not available with Java for Android development, and to obtain safer code (i.e., avoid null-pointer exceptions). We also identified research directions that the research community could focus on in order to help developers to improve the experience of migrating their Java applications to Kotlin.
SEDec 16, 2019Code
RTj: a Java framework for detecting and refactoring rotten green test casesMatias Martinez, Anne Etien, Stéphane Ducasse et al.
Rotten green tests are passing tests which have, at least, one assertion not executed. They give developers a false confidence. In this paper, we present, RTj, a framework that analyzes test cases from Java projects with the goal of detecting and refactoring rotten test cases. RTj automatically discovered 427 rotten tests from 26 open-source Java projects hosted on GitHub. Using RTj, developers have an automated recommendation of the tests that need to be modified for improving the quality of the applications under test.
SEOct 11, 2019Code
Repairnator patches programs automaticallyMartin Monperrus, Simon Urli, Thomas Durieux et al.
Repairnator is a bot. It constantly monitors software bugs discovered during continuous integration of open-source software and tries to fix them automatically. If it succeeds in synthesizing a valid patch, Repairnator proposes the patch to the human developers, disguised under a fake human identity. To date, Repairnator has been able to producepatches that were accepted by the human developers and permanently merged into the code base. This is a milestone for human-competitiveness in software engineering research on automatic program repair.
SENov 10, 2018Code
Nopol: Automatic Repair of Conditional Statement Bugs in Java ProgramsJifeng Xuan, Matias Martinez, Favio Demarco et al.
We propose NOPOL, an approach to automatic repair of buggy conditional statements (i.e., if-then-else statements). This approach takes a buggy program as well as a test suite as input and generates a patch with a conditional expression as output. The test suite is required to contain passing test cases to model the expected behavior of the program and at least one failing test case that reveals the bug to be repaired. The process of NOPOL consists of three major phases. First, NOPOL employs angelic fix localization to identify expected values of a condition during the test execution. Second, runtime trace collection is used to collect variables and their actual values, including primitive data types and objected-oriented features (e.g., nullness checks), to serve as building blocks for patch generation. Third, NOPOL encodes these collected data into an instance of a Satisfiability Modulo Theory (SMT) problem, then a feasible solution to the SMT instance is translated back into a code patch. We evaluate NOPOL on 22 real-world bugs (16 bugs with buggy IF conditions and 6 bugs with missing preconditions) on two large open-source projects, namely Apache Commons Math and Apache Commons Lang. Empirical analysis on these bugs shows that our approach can effectively fix bugs with buggy IF conditions and missing preconditions. We illustrate the capabilities and limitations of NOPOL using case studies of real bug fixes.
SEOct 19, 2018Code
Coming: a Tool for Mining Change Pattern Instances from Git CommitsMatias Martinez, Martin Monperrus
Software repositories such as Git have become a relevant source of information for software engineer researcher. For instance, the detection of Commits that fulfill a given criterion (e.g., bugfixing commits) is one of the most frequent tasks done to understand the software evolution. However, to our knowledge, there is not open-source tools that, given a Git repository, returns all the instances of a given change pattern. In this paper we present Coming, a tool that takes an input a Git repository and mines instances of change patterns on each commit. For that, Coming computes fine-grained changes between two consecutive revisions, analyzes those changes to detect if they correspond to an instance of a change pattern (specified by the user using XML), and finally, after analyzing all the commits, it presents a) the frequency of code changes and b) the instances found on each commit. We evaluate Coming on a set of 28 pairs of revisions from Defects4J, finding instances of change patterns that involve If conditions on 26 of them.
SEOct 13, 2018Code
Human-competitive Patches in Automatic Program Repair with RepairnatorMartin Monperrus, Simon Urli, Thomas Durieux et al.
Repairnator is a bot. It constantly monitors software bugs discovered during continuous integration of open-source software and tries to fix them automatically. If it succeeds to synthesize a valid patch, Repairnator proposes the patch to the human developers, disguised under a fake human identity. To date, Repairnator has been able to produce 5 patches that were accepted by the human developers and permanently merged in the code base. This is a milestone for human-competitiveness in software engineering research on automatic program repair.
SEJul 31, 2018Code
An Empirical Study on Quality of Android Applications written in Kotlin languageBruno Gois Mateus, Matias Martinez
Context: During the last years, developers of mobile applications have the possibility to use new paradigms and tools for developing mobile applications. For instance, since 2017 Android developers have the official support to write Android applications using Kotlin language. Kotlin is programming language fully interoperable with Java that combines object-oriented and functional features. Objective: The goal of this paper is twofold. First, it aims to study the degree of adoption of Kotlin language on development of open-source Android applications and to measure the amount of Kotlin code inside Android applications. Secondly, it aims to measure the quality of Android applications that are written using Kotlin and to compare it with the quality of Android applications written using Java. Method: We first defined a method to detect Kotlin applications from a dataset of open-source Android applications. Then, we analyzed those applications to detect instances of code smells and computed an estimation of quality of the applications. Finally, we studied how the introduction of Kotlin code impacts on the quality of an Android application. Results: Our experiment found that 11.26% of applications from a dataset with 2,167 open-source applications have been written (partially or fully) using Kotlin language. We found that the introduction of Kotlin code increases the quality (in terms of presence of code smells) of the majority of the Android applications initially written in Java.
SEJul 15, 2017Code
Sorting and Transforming Program Repair Ingredients via Deep Learning Code SimilaritiesMartin White, Michele Tufano, Matias Martinez et al.
In the field of automated program repair, the redundancy assumption claims large programs contain the seeds of their own repair. However, most redundancy-based program repair techniques do not reason about the repair ingredients---the code that is reused to craft a patch. We aim to reason about the repair ingredients by using code similarities to prioritize and transform statements in a codebase for patch generation. Our approach, DeepRepair, relies on deep learning to reason about code similarities. Code fragments at well-defined levels of granularity in a codebase can be sorted according to their similarity to suspicious elements (i.e., code elements that contain suspicious statements) and statements can be transformed by mapping out-of-scope identifiers to similar identifiers in scope. We examined these new search strategies for patch generation with respect to effectiveness from the viewpoint of a software maintainer. Our comparative experiments were executed on six open-source Java projects including 374 buggy program revisions and consisted of 19,949 trials spanning 2,616 days of computation time. DeepRepair's search strategy using code similarities generally found compilable ingredients faster than the baseline, jGenProg, but this improvement neither yielded test-adequate patches in fewer attempts (on average) nor found significantly more patches than the baseline. Although the patch counts were not statistically different, there were notable differences between the nature of DeepRepair patches and baseline patches. The results demonstrate that our learning-based approach finds patches that cannot be found by existing redundancy-based repair techniques.
SEOct 24, 2014Code
ASTOR: Evolutionary Automatic Software Repair for JavaMatias Martinez, Martin Monperrus
Context: During last years, many automatic software repair approaches have been presented by the software engineering research community. According to the corresponding papers, these approaches are able to repair real defects from open source projects. Problematic: Some previous publications in the automatic repair field do not provide the implementation of theirs approaches. Consequently, it is not possible for the research community to re-execute the original evaluation, to set up new evaluations (for example, to evaluate the performance against new defects) or to compare approaches against each others. Solution: We propose a publicly available automatic software repair tool called Astor. It implements three state-of-the-art automatic software repair approaches in the context of Java programs (including GenProg and a subset of PAR's templates). The source code of Astor is licensed under the GNU General Public Licence (GPL v2).
SESep 15, 2013Code
Automatically Extracting Instances of Code Change Patterns with AST AnalysisMatias Martinez, Laurence Duchien, Martin Monperrus
A code change pattern represents a kind of recurrent modification in software. For instance, a known code change pattern consists of the change of the conditional expression of an if statement. Previous work has identified different change patterns. Complementary to the identification and definition of change patterns, the automatic extraction of pattern instances is essential to measure their empirical importance. For example, it enables one to count and compare the number of conditional expression changes in the history of different projects. In this paper we present a novel approach for search patterns instances from software history. Our technique is based on the analysis of Abstract Syntax Trees (AST) files within a given commit. We validate our approach by counting instances of 18 change patterns in 6 open-source Java projects.
LGFeb 12, 2024
Identifying architectural design decisions for achieving green ML servingFrancisco Durán, Silverio Martínez-Fernández, Matias Martinez et al.
The growing use of large machine learning models highlights concerns about their increasing computational demands. While the energy consumption of their training phase has received attention, fewer works have considered the inference phase. For ML inference, the binding of ML models to the ML system for user access, known as ML serving, is a critical yet understudied step for achieving efficiency in ML applications. We examine the literature in ML architectural design decisions and Green AI, with a special focus on ML serving. The aim is to analyze ML serving architectural design decisions for the purpose of understanding and identifying them with respect to quality characteristics from the point of view of researchers and practitioners in the context of ML serving literature. Our results (i) identify ML serving architectural design decisions along with their corresponding components and associated technological stack, and (ii) provide an overview of the quality characteristics studied in the literature, including energy efficiency. This preliminary study is the first step in our goal to achieve green ML serving. Our analysis may aid ML researchers and practitioners in making green-aware architecture design decisions when serving their models.
SEDec 19, 2024
Insights into resource utilization of code small language models serving with runtime engines and execution providersFrancisco Durán, Matias Martinez, Patricia Lago et al.
The rapid growth of language models, particularly in code generation, requires substantial computational resources, raising concerns about energy consumption and environmental impact. Optimizing language models inference resource utilization is crucial, and Small Language Models (SLMs) offer a promising solution to reduce resource demands. Our goal is to analyze the impact of deep learning serving configurations, defined as combinations of runtime engines and execution providers, on resource utilization, in terms of energy consumption, execution time, and computing-resource utilization from the point of view of software engineers conducting inference in the context of code generation SLMs. We conducted a technology-oriented, multi-stage experimental pipeline using twelve code generation SLMs to investigate energy consumption, execution time, and computing-resource utilization across the configurations. Significant differences emerged across configurations. CUDA execution provider configurations outperformed CPU execution provider configurations in both energy consumption and execution time. Among the configurations, TORCH paired with CUDA demonstrated the greatest energy efficiency, achieving energy savings from 37.99% up to 89.16% compared to other serving configurations. Similarly, optimized runtime engines like ONNX with the CPU execution provider achieved from 8.98% up to 72.04% energy savings within CPU-based configurations. Also, TORCH paired with CUDA exhibited efficient computing-resource utilization. Serving configuration choice significantly impacts resource utilization. While further research is needed, we recommend the above configurations best suited to software engineers' requirements for enhancing serving resource utilization efficiency.
SENov 24, 2021
FLACOCO: Fault Localization for Java based on Industry-grade CoverageAndré Silva, Matias Martinez, Benjamin Danglot et al.
Fault localization is an essential step in the debugging process. Spectrum-Based Fault Localization (SBFL) is a popular fault localization family of techniques, utilizing code-coverage to predict suspicious lines of code. In this paper, we present FLACOCO, a new fault localization tool for Java. The key novelty of FLACOCO is that it is built on top of one of the most used and most reliable coverage libraries for Java, JaCoCo. FLACOCO is made available through a well-designed command-line interface and Java API and supports all Java versions. We validate FLACOCO on two use-cases from the automatic program repair domain by reproducing previous scientific experiments. We find it is capable of effectively replacing the state-of-the-art FL library. Overall, we hope that FLACOCO will help research in fault localization as well as industry adoption thanks to being founded on industry-grade code coverage. An introductory video is available at https://youtu.be/RFRyvQuwRYA
SEAug 10, 2021
Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff SizeMartin Monperrus, Matias Martinez, He Ye et al.
This paper presents Megadiff, a dataset of source code diffs. It focuses on Java, with strict inclusion criteria based on commit message and diff size. Megadiff contains 663 029 Java diffs that can be used for research on commit comprehension, fault localization, automated program repair, and machine learning on code changes.
SEMay 10, 2021
Neural Program Repair with Execution-based BackpropagationHe Ye, Matias Martinez, Martin Monperrus
Neural machine translation (NMT) architectures have achieved promising results for automatic program repair. Yet, they have the limitation of generating low-quality patches (e.g., not compilable patches). This is because the existing works only optimize a purely syntactic loss function based on characters and tokens without incorporating program-specific information during neural network weight optimization. In this paper, we propose a novel program repair model called RewardRepair. The core novelty of RewardRepair is to improve NMT-based program repair with a loss function based on program compilation and test execution information, rewarding the network to produce patches that compile and that do not overfit. We conduct several experiments to evaluate RewardRepair showing that it is feasible and effective to use compilation and test execution results to optimize the underlying neural repair model. RewardRepair correctly repairs 207 bugs over four benchmarks. we report on repair success for 121 bugs that are fixed for the first time in the literature. Also, RewardRepair produces up to 45.3% of compilable patches, an improvement over the 39% by the state-of-the-art.
SEDec 12, 2020
A Software-Repair Robot based on Continual LearningBenoit Baudry, Zimin Chen, Khashayar Etemadi et al.
Software bugs are common and correcting them accounts for a significant part of costs in the software development and maintenance process. This calls for automatic techniques to deal with them. One promising direction towards this goal is gaining repair knowledge from historical bug fixing examples. Retrieving insights from software development history is particularly appealing with the constant progress of machine learning paradigms and skyrocketing `big' bug fixing data generated through Continuous Integration (CI). In this paper, we present R-Hero, a novel software repair bot that applies continual learning to acquire bug fixing strategies from continuous streams of source code changes, implemented for the single development platform Github/Travis CI. We describe R-Hero, our novel system for learning how to fix bugs based on continual training, and we uncover initial successes as well as novel research challenges for the community.
SEDec 11, 2020
A Comprehensive Study of Code-removal Patches in Automated Program RepairDavide Ginelli, Matias Martinez, Leonardo Mariani et al.
Automatic Program Repair (APR) techniques can promisingly help reducing the cost of debugging. Many relevant APR techniques follow the generate-and-validate approach, that is, the faulty program is iteratively modified with different change operators and then validated with a test suite until a plausible patch is generated. In particular, Kali is a generate-and-validate technique developed to investigate the possibility of generating plausible patches by only removing code. Former studies show that indeed Kali successfully addressed several faults. This paper addresses the case of code-removal patches in automated program repair investigating the reasons and the scenarios that make their creation possible, and the relationship with patches implemented by developers. Our study reveals that code-removal patches are often insufficient to fix bugs, and proposes a comprehensive taxonomy of code-removal patches that provides evidence of the problems that may affect test suites, opening new opportunities for researchers in the field of automatic program repair.
SENov 20, 2020
Hyperparameter Optimization for AST DifferencingMatias Martinez, Jean-Rémy Falleri, Martin Monperrus
Computing the differences between two versions of the same program is an essential task for software development and software evolution research. AST differencing is the most advanced way of doing so, and an active research area. Yet, AST differencing algorithms rely on configuration parameters that may have a strong impact on their effectiveness. In this paper, we present a novel approach named DAT (Diff Auto Tuning) for hyperparameter optimization of AST differencing. We thoroughly state the problem of hyper-configuration for AST differencing. We evaluate our data-driven approach DAT to optimize the edit-scripts generated by the state-of-the-art AST differencing algorithm named GumTree in different scenarios. DAT is able to find a new configuration for GumTree that improves the edit-scripts in 21.8% of the evaluated cases.
SEJul 14, 2020
Estimating the Potential of Program Repair Search Spaces with Commit AnalysisKhashayar Etemadi, Niloofar Tarighat, Siddharth Yadav et al.
The most natural method for evaluating program repair systems is to run them on bug datasets, such as Defects4J. Yet, using this evaluation technique on arbitrary real-world programs requires heavy configuration. In this paper, we propose a purely static method to evaluate the potential of the search space of repair approaches. This new method enables researchers and practitioners to encode the search spaces of repair approaches and select potentially useful ones without struggling with tool configuration and execution. We encode the search spaces by specifying the repair strategies they employ. Next, we use the specifications to check whether past commits lie in repair search spaces. For a repair approach, including many human-written past commits in its search space indicates its potential to generate useful patches. We implement our evaluation method in LighteR. LighteR gets a Git repository and outputs a list of commits whose source code changes lie in repair search spaces. We run LighteR on 55,309 commits from the history of 72 Github repositories with and show that LighteR's precision and recall are 77% and 92%, respectively. Overall, our experiments show that our novel method is both lightweight and effective to study the search space of program repair approaches.
SEFeb 10, 2020
E-APR: Mapping the Effectiveness of Automated Program RepairAldeida Aleti, Matias Martinez
Automated Program Repair (APR) is a fast growing area with numerous new techniques being developed to tackle one of the most challenging software engineering problems. APR techniques have shown promising results, giving us hope that one day it will be possible for software to repair itself. In this paper, we focus on the problem of objective performance evaluation of APR techniques. We introduce a new approach, Explaining Automated Program Repair (E-APR), which identifies features of buggy programs that explain why a particular instance is difficult for an APR technique. E-APR is used to examine the diversity and quality of the buggy programs used by most researchers, and analyse the strengths and weaknesses of existing APR techniques. E-APR visualises an instance space of buggy programs, with each buggy program represented as a point in the space. The instance space is constructed to reveal areas of hard and easy buggy programs, and enables the strengths and weaknesses of APR techniques to be identified.
SEOct 26, 2019
Automated Classification of Overfitting Patches with Statically Extracted Code FeaturesHe Ye, Jian Gu, Matias Martinez et al.
Automatic program repair (APR) aims to reduce the cost of manually fixing software defects. However, APR suffers from generating a multitude of overfitting patches, those patches that fail to correctly repair the defect beyond making the tests pass. This paper presents a novel overfitting patch detection system called ODS to assess the correctness of APR patches. ODS first statically compares a patched program and a buggy program in order to extract code features at the abstract syntax tree (AST) level. Then, ODS uses supervised learning with the captured code features and patch correctness labels to automatically learn a probabilistic model. The learned ODS model can then finally be applied to classify new and unseen program repair patches. We conduct a large-scale experiment to evaluate the effectiveness of ODS on patch correctness classification based on 10,302 patches from Defects4J, Bugs.jar and Bears benchmarks. The empirical evaluation shows that ODS is able to correctly classify 71.9% of program repair patches from 26 projects, which improves the state-of-the-art. ODS is applicable in practice and can be employed as a post-processing procedure to classify the patches generated by different APR systems.
SESep 30, 2019
Automated Patch Assessment for Program Repair at ScaleHe Ye, Matias Martinez, Martin Monperrus
In this paper, we do automatic correctness assessment for patches generated by program repair systems. We consider the human-written patch as ground truth oracle and randomly generate tests based on it, a technique proposed by Shamshiri et al., called Random testing with Ground Truth (RGT) in this paper. We build a curated dataset of 638 patches for Defects4J generated by 14 state-of-the-art repair systems, we evaluate automated patch assessment on this dataset. The results of this study are novel and significant: First, we improve the state of the art performance of automatic patch assessment with RGT by 190% by improving the oracle; Second, we show that RGT is reliable enough to help scientists to do overfitting analysis when they evaluate program repair systems; Third, we improve the external validity of the program repair knowledge with the largest study ever.
SEJul 22, 2019
Learning the Relation between Code Features and Code Transforms with Structured PredictionZhongxing Yu, Matias Martinez, Zimin Chen et al.
To effectively guide the exploration of the code transform space for automated code evolution techniques, we present in this paper the first approach for structurally predicting code transforms at the level of AST nodes using conditional random fields (CRFs). Our approach first learns offline a probabilistic model that captures how certain code transforms are applied to certain AST nodes, and then uses the learned model to predict transforms for arbitrary new, unseen code snippets. {Our approach involves a novel representation of both programs and code transforms. Specifically, we introduce the formal framework for defining the so-called AST-level code transforms and we demonstrate how the CRF model can be accordingly designed, learned, and used for prediction}. We instantiate our approach in the context of repair transform prediction for Java programs. Our instantiation contains a set of carefully designed code features, deals with the training data imbalance issue, and comprises transform constraints that are specific to code. We conduct a large-scale experimental evaluation based on a dataset of bug fixing commits from real-world Java projects. The results show that when the popular evaluation metric \emph{top-3} is used, our approach predicts the code transforms with an accuracy varying from 41\% to 53\% depending on the transforms. Our model outperforms two baselines based on history probability and neural machine translation (NMT), suggesting the importance of considering code structure in achieving good prediction accuracy. In addition, a proof-of-concept synthesizer is implemented to concretize some repair transforms to get the final patches. The evaluation of the synthesizer on the Defects4j benchmark confirms the usefulness of the predicted AST-level repair transforms in producing high-quality patches.
SEJul 21, 2019
On the adoption, usage and evolution of Kotlin Features on Android developmentBruno Góis Mateus, Matias Martinez
Background: Google announced Kotlin as an Android official programming language in 2017, giving developers an option of writing applications using a language that combines object-oriented and functional features. Aims: The goal of this work is to understand the usage of Kotlin features considering four aspects: i) which features are adopted, ii) what is the degree of adoption, iii)when are these features added into Android applications for the first time, and iv) how the usage of features evolves along with applications' evolution. Method: Exploring the source code of 387 Android applications, we identify the usage of Kotlin features on each version application's version and compute the moment that each feature is used for the first time. Finally, we identify the evolution trend that better describes the usage of these features. Results: 15 out of 26 features are used on at least 50% of applications. Moreover, we found that type inference, lambda and safe call are the most used features. Also, we observed that the most used Kotlin features are those first included on Android applications. Finally, we report that the majority of applications tend to add more instances of 24 out of 26 features along with their evolution. {\bf Conclusions:} Our study generates 7 main findings. We present their implications, which are addressed to developers, researchers and tool builders in order to foster the use of Kotlin features to develop Android applications.
SEMay 28, 2019
Empirical Review of Java Program Repair Tools: A Large-Scale Experiment on 2,141 Bugs and 23,551 Repair AttemptsThomas Durieux, Fernanda Madeiral, Matias Martinez et al.
In the past decade, research on test-suite-based automatic program repair has grown significantly. Each year, new approaches and implementations are featured in major software engineering venues. However, most of those approaches are evaluated on a single benchmark of bugs, which are also rarely reproduced by other researchers. In this paper, we present a large-scale experiment using 11 Java test-suite-based repair tools and 5 benchmarks of bugs. Our goal is to have a better understanding of the current state of automatic program repair tools on a large diversity of benchmarks. Our investigation is guided by the hypothesis that the repairability of repair tools might not be generalized across different benchmarks of bugs. We found that the 11 tools 1) are able to generate patches for 21% of the bugs from the 5 benchmarks, and 2) have better performance on Defects4J compared to other benchmarks, by generating patches for 47% of the bugs from Defects4J compared to 10-30% of bugs from the other benchmarks. Our experiment comprises 23,551 repair attempts in total, which we used to find the causes of non-patch generation. These causes are reported in this paper, which can help repair tool designers to improve their techniques and tools.
SENov 4, 2018
Automatic Repair of Real Bugs in Java: A Large-Scale Experiment on the Defects4J DatasetMatias Martinez, Thomas Durieux, Romain Sommerard et al.
Defects4J is a large, peer-reviewed, structured dataset of real-world Java bugs. Each bug in Defects4J comes with a test suite and at least one failing test case that triggers the bug. In this paper, we report on an experiment to explore the effectiveness of automatic test-suite based repair on Defects4J. The result of our experiment shows that the considered state-of-the-art repair methods can generate patches for 47 out of 224 bugs. However, those patches are only test-suite adequate, which means that they pass the test suite and may potentially be incorrect beyond the test-suite satisfaction correctness criterion. We have manually analyzed 84 different patches to assess their real correctness. In total, 9 real Java bugs can be correctly repaired with test-suite based repair. This analysis shows that test-suite based repair suffers from under-specified bugs, for which trivial or incorrect patches still pass the test suite. With respect to practical applicability, it takes on average 14.8 minutes to find a patch. The experiment was done on a scientific grid, totaling 17.6 days of computation time. All the repair systems and experimental results are publicly available on Github in order to facilitate future research on automatic repair.
SEOct 24, 2018
Alleviating Patch Overfitting with Automatic Test Generation: A Study of Feasibility and Effectiveness for the Nopol Repair SystemZhongxing Yu, Matias Martinez, Benjamin Danglot et al.
Among the many different kinds of program repair techniques, one widely studied family of techniques is called test suite based repair. However, test suites are in essence input-output specifications and are thus typically inadequate for completely specifying the expected behavior of the program under repair. Consequently, the patches generated by test suite based repair techniques can just overfit to the used test suite, and fail to generalize to other tests. We deeply analyze the overfitting problem in program repair and give a classification of this problem. This classification will help the community to better understand and design techniques to defeat the overfitting problem. We further propose and evaluate an approach called UnsatGuided, which aims to alleviate the overfitting problem for synthesis-based repair techniques with automatic test case generation. The approach uses additional automatically generated tests to strengthen the repair constraint used by synthesis-based repair techniques. We analyze the effectiveness of UnsatGuided: 1) analytically with respect to alleviating two different kinds of overfitting issues; 2) empirically based on an experiment over the 224 bugs of the Defects4J repository. The main result is that automatic test generation is effective in alleviating one kind of overfitting issue--regression introduction, but due to oracle problem, has minimal positive impact on alleviating the other kind of overfitting issue--incomplete fixing.
SEMay 9, 2018
A Comprehensive Study of Automatic Program Repair on the QuixBugs BenchmarkHe Ye, Matias Martinez, Thomas Durieux et al.
Automatic program repair papers tend to repeatedly use the same benchmarks. This poses a threat to the external validity of the findings of the program repair research community. In this paper, we perform an empirical study of automatic repair on a benchmark of bugs called QuixBugs, which has been little studied. In this paper, 1) We report on the characteristics of QuixBugs; 2) We study the effectiveness of 10 program repair tools on it; 3) We apply three patch correctness assessment techniques to comprehensively study the presence of overfitting patches in QuixBugs. Our key results are: 1) 16/40 buggy programs in QuixBugs can be repaired with at least a test suite adequate patch; 2) A total of 338 plausible patches are generated on the QuixBugs by the considered tools, and 53.3% of them are overfitting patches according to our manual assessment; 3) The three automated patch correctness assessment techniques, RGT_Evosuite, RGT_InputSampling and GT_Invariants, achieve an accuracy of 98.2%, 80.8% and 58.3% in overfitting detection, respectively. To our knowledge, this is the largest empirical study of automatic repair on QuixBugs, combining both quantitative and qualitative insights. All our empirical results are publicly available on GitHub in order to facilitate future research on automatic program repair.
SEFeb 9, 2018
Astor: Exploring the Design Space of Generate-and-Validate Program Repair beyond GenProgMatias Martinez, Martin Monperrus
During last years, researches have proposed novel repair approaches that automatically generate patches for repairing software bugs. Repair approaches can be loosely characterized along the main design philosophy such generate- and-validate or synthesis-based. Each of those repair approaches is a point in the design space of program repair. Our goal is to facilitate the design, development and evaluation of repair approaches by providing a framework that: a) contains components commonly present in approaches implementations thus new approaches can be built over them, b) provides built-in implementations of existing repair approach. This paper presents a framework named Astor that encores the design space of generate-and-validate repair approaches. Astor provides extension points that form the explicit decision space of program repair. Over those extension points, researchers can reuse existing components or implements new ones. Astor includes 6 Java implementation of repair approaches, including one of the pioneer: GenProg. Researcher have been already defining new approaches over Astor, proposing improvements of those built-in approaches by using the extension points, and executing approaches implementations from Astor in their evaluations. The implementations of the repair approaches built over Astor are capable of repair, in total, 98 real bugs from 5 large Java programs.
SEJan 22, 2018
Do Mobile Developers Ask on Q&A Sites About Error Codes Thrown by a Cross-Platform App Development Framework? An Empirical StudyMatias Martinez, Sylvain Lecomte
During last years development frameworks have emerged to make easier the development and maintenance of cross-platform mobile applications. Xamarin framework is one of them: it takes as input an app written in C# and produces native code for Android, iOS and Windows Mobile platforms. When using Xamarin, developers can meet errors, identified with codes, thrown by the framework.Unfortunately, the Xamarin official documentation does not provide a complete description, solution or workaround for all those codes.In this paper, we analyze two sites of questions and answers related to Xamarin for finding questions that mention those error codes. We found in both sites that there are questions written by developers asking about Xamarin errors, and the majority of them have at least one answer. Our intuition is this discovered information could be useful for giving support to Xamarin developers.
SEDec 27, 2017
Two Datasets of Questions and Answers for Studying the Development of Cross-platform Mobile Applications using Xamarin FrameworkMatias Martinez
A cross-platform mobile application is an application that runs on multiple mobile platforms (Android, iOS). Several frameworks, have been proposed to simplify the development of cross-platform mobile applications and, therefore, to reduce development and maintenance costs. Between them, the cross-compiler mobile development frameworks, such as Xamarin from Microsoft, transform the application's code written in intermediate (aka non native) language to native code for each desired platform. However, to our best knowledge, there is no much research about the advantages and disadvantages of the use of those frameworks during the development and maintenance phases of mobile applications. The objective of this paper is twofold. Firstly, to present two datasets of questions and answers (Q&A) related to the development of mobile applications using Xamarin. Secondly, to show their usefulness, we present a replication study for discovering the main discussion topics of Xamarin development. We created the two datasets by mining two Q&A sites: Xamarin Forum and Stack Overflow. Then, for discovering the main topics of the questions from both datasets, we replicated a study that applies Latent Dirichlet Allocation (LDA). Finally, we compared the discovered topics with those topics about general mobile development reported by a previous study. Our datasets have 85,908 questions mined from the Xamarin Forum and 44,434 from Stack Overflow. Between the main topics discovered from those questions, we found that some of them are exclusively related to Xamarin and Microsoft technologies such as the design pattern "MVVM". Both datasets with Xamarin-related Q&A can be used by the research community for understanding the main concerns about developing cross-platform mobile applications using Xamarin. In this paper, we used it for replicating a studying about topic discovering.
SEDec 11, 2017
Ultra-Large Repair Search Space with Automatically Mined Templates: the Cardumen Mode of AstorMatias Martinez, Martin Monperrus
Astor is a program repair library which has different modes. In this paper, we present the Cardumen mode of Astor, a repair approach based mined templates that has an ultra-large search space. We evaluate the capacity of Cardumen to discover test-suite adequate patches (aka plausible patches) over the 356 real bugs from Defects4J. Cardumen finds 8935 patches over 77 bugs of Defects4J. This is the largest number of automatically synthesized patches ever reported, all patches being available in an open-science repository. Moreover, Cardumen identifies 8 unique patches, that are patches for Defects4J bugs that were never repaired in the whole history of program repair.
SEMar 10, 2017
XamForumDB: a dataset for studying Q&A about cross-platform mobile applications developmentMatias Martinez, Sylvain Lecomte
Android and iSO are the two mobile platforms present in almost all smartphones build during last years. Developing an application that targets both platforms is a challenge. A traditional way is to build two different apps, one in Java for Android, the other in Objective-C for iOS. Xamarin is a framework for developing Android and iOS apps which allows developers to share most of the application code across multiple implementations of the app, each for a specific platform. In this paper, we present XamForumDB, a database that stores discussions, questions and answers extracted from the Xamarin forum. We envision research community could use it for studying, for instance, the problematic of developing such kind of applications.
SEMar 1, 2017
Test Case Generation for Program Repair: A Study of Feasibility and EffectivenessZhongxing Yu, Matias Martinez, Benjamin Danglot et al.
Among the many different kinds of program repair techniques, one widely studied family of techniques is called test suite based repair. Test-suites are in essence input-output specifications and are therefore typically inadequate for completely specifying the expected behavior of the program under repair. Consequently, the patches generated by test suite based program repair techniques pass the test suite, yet may be incorrect. Patches that are overly specific to the used test suite and fail to generalize to other test cases are called overfitting patches. In this paper, we investigate the feasibility and effectiveness of test case generation in alleviating the overfitting issue. We propose two approaches for using test case generation to improve test suite based repair, and perform an extensive evaluation of the effectiveness of the proposed approaches in enabling better test suite based repair on 224 bugs of the Defects4J repository. The results indicate that test case generation can change the resulting patch, but is not effective at turning incorrect patches into correct ones. We identify the problems related with the ineffectiveness, and anticipate that our results and findings will lead to future research to build test-case generation techniques that are tailored to automatic repair systems.
SEJan 24, 2017
Towards the quality improvement of cross-platform mobile applicationsMatias Martinez, Sylvain Lecomte
During last ten years, the number of smartphones and mobile applications has been constantly growing. Android, iOS and Windows Mobile are three mobile platforms that cover almost all smartphones in the world in 2017. Developing a mobile app involves first to choose the platforms the app will run, and then to develop specific solutions (i.e., native apps) for each chosen platform using platform-related toolkits such as AndroidSDK. Across-platform mobile application is an app that runs on two or more mobile platforms. Several frameworks have been proposed to simplify the development of cross-platform mobile applications and to reduce development and maintenance costs.They are called cross-platform mobile app development frameworks.However, to our knowledge, the life-cycle and the quality of cross-platforms mobile applications built using those frameworks have not been studied in depth. Our main goal is to first study the processes of development and maintenance of mobile applications built using cross-platform mobile app development frameworks, focusing particularly on the bug-fixing activity. Then, we aim at defining tools for automated repairing bugs from cross-platform mobile applications.
SEJun 5, 2015
Dynamic Analysis can be Improved with Automatic Test Suite RefactoringJifeng Xuan, Benoit Cornu, Matias Martinez et al.
Context: Developers design test suites to automatically verify that software meets its expected behaviors. Many dynamic analysis techniques are performed on the exploitation of execution traces from test cases. However, in practice, there is only one trace that results from the execution of one manually-written test case. Objective: In this paper, we propose a new technique of test suite refactoring, called B-Refactoring. The idea behind B-Refactoring is to split a test case into small test fragments, which cover a simpler part of the control flow to provide better support for dynamic analysis. Method: For a given dynamic analysis technique, our test suite refactoring approach monitors the execution of test cases and identifies small test cases without loss of the test ability. We apply B-Refactoring to assist two existing analysis tasks: automatic repair of if-statements bugs and automatic analysis of exception contracts. Results: Experimental results show that test suite refactoring can effectively simplify the execution traces of the test suite. Three real-world bugs that could previously not be fixed with the original test suite are fixed after applying B-Refactoring; meanwhile, exception contracts are better verified via applying B-Refactoring to original test suites. Conclusions: We conclude that applying B-Refactoring can effectively improve the purity of test cases. Existing dynamic analysis tasks can be enhanced by test suite refactoring.
SEMay 26, 2015
Automatic Repair of Real Bugs: An Experience Report on the Defects4J DatasetMatias Martinez, Thomas Durieux, Jifeng Xuan et al.
Defects4J is a large, peer-reviewed, structured dataset of real-world Java bugs. Each bug in Defects4J is provided with a test suite and at least one failing test case that triggers the bug. In this paper, we report on an experiment to explore the effectiveness of automatic repair on Defects4J. The result of our experiment shows that 47 bugs of the Defects4J dataset can be automatically repaired by state-of- the-art repair. This sets a baseline for future research on automatic repair for Java. We have manually analyzed 84 different patches to assess their real correctness. In total, 9 real Java bugs can be correctly fixed with test-suite based repair. This analysis shows that test-suite based repair suffers from under-specified bugs, for which trivial and incorrect patches still pass the test suite. With respect to practical applicability, it takes in average 14.8 minutes to find a patch. The experiment was done on a scientific grid, totaling 17.6 days of computation time. All their systems and experimental results are publicly available on Github in order to facilitate future research on automatic repair.
SEMar 25, 2014
Do the Fix Ingredients Already Exist? An Empirical Inquiry into the Redundancy Assumptions of Program Repair ApproachesMatias Martinez, Westley Weimer, Martin Monperrus
Much initial research on automatic program repair has focused on experimental results to probe their potential to find patches and reduce development effort. Relatively less effort has been put into understanding the hows and whys of such approaches. For example, a critical assumption of the GenProg technique is that certain bugs can be fixed by copying and re-arranging existing code. In other words, GenProg assumes that the fix ingredients already exist elsewhere in the code. In this paper, we formalize these assumptions around the concept of ''temporal redundancy''. A temporally redundant commit is only composed of what has already existed in previous commits. Our experiments show that a large proportion of commits that add existing code are temporally redundant. This validates the fundamental redundancy assumption of GenProg.
SENov 14, 2013
Mining Software Repair Models for Reasoning on the Search Space of Automated Program FixingMatias Martinez, Martin Monperrus
This paper is about understanding the nature of bug fixing by analyzing thousands of bug fix transactions of software repositories. It then places this learned knowledge in the context of automated program repair. We give extensive empirical results on the nature of human bug fixes at a large scale and a fine granularity with abstract syntax tree differencing. We set up mathematical reasoning on the search space of automated repair and the time to navigate through it. By applying our method on 14 repositories of Java software and 89,993 versioning transactions, we show that not all probabilistic repair models are equivalent.