SEApr 24, 2023Code
ITER: Iterative Neural Repair for Multi-Location PatchesHe Ye, Martin Monperrus
Automated program repair (APR) has achieved promising results, especially using neural networks. Yet, the overwhelming majority of patches produced by APR tools are confined to one single location. When looking at the patches produced with neural repair, most of them fail to compile, while a few uncompilable ones go in the right direction. In both cases, the fundamental problem is to ignore the potential of partial patches. In this paper, we propose an iterative program repair paradigm called ITER founded on the concept of improving partial patches until they become plausible and correct. First, ITER iteratively improves partial single-location patches by fixing compilation errors and further refining the previously generated code. Second, ITER iteratively improves partial patches to construct multi-location patches, with fault localization re-execution. ITER is implemented for Java based on battle-proven deep neural networks and code representation. ITER is evaluated on 476 bugs from 10 open-source projects in Defects4J 2.0. ITER succeeds in repairing 15.5% of them, including 9 uniquely repaired multi-location bugs.
SESep 26, 2023
Supersonic: Learning to Generate Source Code Optimizations in C/C++Zimin Chen, Sen Fang, Martin Monperrus
Software optimization refines programs for resource efficiency while preserving functionality. Traditionally, it is a process done by developers and compilers. This paper introduces a third option, automated optimization at the source code level. We present Supersonic, a neural approach targeting minor source code modifications for optimization. Using a seq2seq model, Supersonic is trained on C/C++ program pairs ($x_{t}$, $x_{t+1}$), where $x_{t+1}$ is an optimized version of $x_{t}$, and outputs a diff. Supersonic's performance is benchmarked against OpenAI's GPT-3.5-Turbo and GPT-4 on competitive programming tasks. The experiments show that Supersonic not only outperforms both models on the code optimization task but also minimizes the extent of the change with a model more than 600x smaller than GPT-3.5-Turbo and 3700x smaller than GPT-4.
SESep 27, 2024
RepairBench: Leaderboard of Frontier Models for Program RepairAndré Silva, Martin Monperrus
AI-driven program repair uses AI models to repair buggy software by producing patches. Rapid advancements in AI surely impact state-of-the-art performance of program repair. Yet, grasping this progress requires frequent and standardized evaluations. We propose RepairBench, a novel leaderboard for AI-driven program repair. The key characteristics of RepairBench are: 1) it is execution-based: all patches are compiled and executed against a test suite, 2) it assesses frontier models in a frequent and standardized way. RepairBench leverages two high-quality benchmarks, Defects4J and GitBug-Java, to evaluate frontier models against real-world program repair tasks. We publicly release the evaluation framework of RepairBench. We will update the leaderboard as new frontier models are released.
SEApr 30Code
The Grand Software Supply Chain of AI SystemsCarmine Cesarano, Martin Monperrus
AI systems rest on software with low integrity mechanisms, leaving AI systems exposed across every stage from data acquisition to final inference. This paper makes the AI supply chain a first-class object of analysis, decomposing it across four architectural layers: data acquisition, model training, model inference, and a cross-cutting substrate. Within these layers, we identify four structural gaps that traditional supply chain mechanisms do not address: verifiability, versioning, observability, and traceability.Current AI systems fall short on all of them: they carry undeclared behavioral couplings that no resolver enforces; they cannot be reverted back to known working assemblies; they degrade silently rather than surfacing breaking changes; and their lineage can hardly be approximated. To illustrate the scale of the software supply chain of AI, we measure a reference stack of 48 production-grade open-source projects, which declares 4,664 direct dependencies, resolves to 11,508 transitive packages, and totals roughly 392M lines of code.
CRApr 27Code
Evaluating Cryptographic API Misuse Detectors for GoVivi Andersson, Martin Monperrus
Cryptographic API misuse represents a critical vulnerability class that undermines the security foundations of modern software. Yet, it remains largely unexplored in Go despite its dominance in security-critical infrastructure. This paper presents the first comprehensive study of cryptographic API misuse detection in Go, identifying and analyzing 4 state-of-the-art tools (CodeQL, Gopher, Gosec, and Snyk Code) and establishing a consolidated taxonomy of 14 relevant misuse classes. Through an experimental evaluation of 328 security-critical open-source Go projects, we discovered 7,473 cryptographic API misuses, providing insights into the prevalence and distribution of these vulnerabilities. Our systematic comparison reveals significant variations in misuse coverage, with immediate practical implications for security engineers and long-term implications for research in this domain.
SEApr 21
FIKA: Expanding Dependency Reachability with Executability GuaranteesYogya Gamage, Meriem Ben Chaaben, Martin Monperrus et al.
Automated third-party library analysis tools help developers by addressing key dependency management challenges, such as automating version updates, detecting vulnerabilities, and detecting breaking updates. Dependency reachability analysis aims at improving the precision of dependency management, by reducing the space of dependency issues to the ones that actually matter. Most tools for dependency reachability analysis are static and fundamentally limited by the absence of execution. In this paper, we propose FIKA, a pipeline for providing guarantees of executability for third-party library call sites. FIKA generates code that is executed, and whose execution trace provides guarantees that a third-party library call site is actually reachable. We apply our approach to a dataset of eight Java projects to empirically evaluate the effectiveness of FIKA. On average, 54% of these call sites are covered by the existing test suites, and therefore, have evidence for their executability. FIKA further improves this coverage by 20% and is able to demonstrate executability for 2363 dependency methods. In six out of eight projects, FIKA provides strong guarantees that more than 75% of call sites are executable. We further demonstrate that FIKA is capable of improving the results provided by Semgrep, a state-of-the-art static vulnerability reachability analysis tool. We show that FIKA can help prioritize the vulnerability updates with stronger guarantees of executability in cases where Semgrep yields inconclusive reachability results.
SEOct 27, 2025
The Design Space of Lockfiles Across Package ManagersYogya Gamage, Deepika Tiwari, Martin Monperrus et al.
Software developers reuse third-party packages that are hosted in package registries. At build time, a package manager resolves and fetches the direct and indirect dependencies of a project. Most package managers also generate a lockfile, which records the exact set of resolved dependency versions. Lockfiles are used to reduce build times; to verify the integrity of resolved packages; and to support build reproducibility across environments and time. Despite these beneficial features, developers often struggle with their maintenance, usage, and interpretation. In this study, we unveil the major challenges related to lockfiles, such that future researchers and engineers can address them. We perform the first comprehensive study of lockfiles across 7 popular package managers, npm, pnpm, Cargo, Poetry, Pipenv, Gradle, and Go. First, we highlight the wide variety of design decisions that package managers make, regarding the generation process as well as the content of lockfiles. Next, we conduct a qualitative analysis based on semi-structured interviews with 15 developers. We capture first-hand insights about the benefits that developers perceive in lockfiles, as well as the challenges they face to manage these files. Following these observations, we make 5 recommendations to further improve lockfiles, for a better developer experience.
SEMar 18
Bootstrapping Coding Agents: The Specification Is the ProgramMartin Monperrus
A coding agent can bootstrap itself. Starting from a 926-word specification and a first implementation produced by an existing agent (Claude Code), a newly generated agent re-implements the same specification correctly from scratch. This reproduces, in the domain of AI coding agents, the classical bootstrap sequence known from compiler construction, and instantiates the meta-circular property known from Lisp. The result carries a practical implication: the specification, not the implementation, is the stable artifact of record. Improving an agent means improving its specification; the implementation is, in principle, regenerable at any time.
SEApr 14
Classport: Designing Runtime Dependency Introspection for JavaSerena Cofano, Daniel Williams, Aman Sharma et al.
Runtime introspection of dependencies, i.e., the ability to observe which dependencies are currently used during program execution, is fundamental for Software Supply Chain security. Yet, Java has no support for it. We solve this problem with Classport, a blueprint and system that embeds dependency information into Java class files, enabling the retrieval of dependency information at runtime. We evaluate Classport on six real-world projects, demonstrating the feasibility in identifying dependencies at runtime.
SEMar 25
Software Supply Chain Smells: Lightweight Analysis for Secure Dependency ManagementLarissa Schmid, Diogo Gaspar, Raphina Liu et al.
Modern software systems heavily rely on third-party dependencies, making software supply chain security a critical concern. We introduce the concept of software supply chain smells as structural indicators that signal potential security risks. We design and evaluate Dirty-Waters, a novel tool for detecting such smells in the supply chains of software packages. Through interviews with practitioners, we show that our proposed smells align with real-world concerns and capture signals considered valuable. A quantitative study of popular packages in the Maven and NPM ecosystems reveals that while smells are prevalent in both, they differ significantly across ecosystems, with traceability and signing issues dominating in Maven and most smells being rare in NPM, due to strong registry-level guarantees. Software supply chain smells support developers and organizations in making informed decisions and improving their software supply chain security posture.
SEFeb 17
Byam: Fixing Breaking Dependency Updates with Large Language ModelsFrank Reyes, May Mahmoud, Federico Bono et al.
Application Programming Interfaces (APIs) facilitate the integration of third-party dependencies within the code of client applications. However, changes to an API, such as deprecation, modification of parameter names or types, or complete replacement with a new API, can break existing client code. These changes are called breaking dependency updates; It is often tedious for API users to identify the cause of these breaks and update their code accordingly. In this paper, we explore the use of Large Language Models (LLMs) to automate client code updates in response to breaking dependency updates. We evaluate our approach on the BUMP dataset, a benchmark for breaking dependency updates in Java projects. Our approach leverages LLMs with advanced prompts, including information from the build process and from the breaking dependency analysis. We assess effectiveness at three granularity levels: at the build level, the file level, and the individual compilation error level. We experiment with five LLMs: Google Gemini-2.0 Flash, OpenAI GPT4o-mini, OpenAI o3-mini, Alibaba Qwen2.5-32b-instruct, and DeepSeek V3. Our results show that LLMs can automatically repair breaking updates. Among the considered models, OpenAI's o3-mini is the best, able to completely fix 27% of the builds when using prompts that include contextual information such as the buggy line, API differences, error messages, and step-by-step reasoning instructions. Also, it fixes 78% of the individual compilation errors. Overall, our findings demonstrate the potential for LLMs to fix compilation errors due to breaking dependency updates, supporting developers in their efforts to stay up-to-date with changes in their dependencies.
LGFeb 6
On Randomness in Agentic EvalsBjarni Haukur Bjarnason, André Silva, Martin Monperrus
Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate. We test this assumption by collecting 60,000 agentic trajectories on SWE-Bench-Verified, spanning three models and two scaffolds. We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected, with standard deviations exceeding 1.5 percentage points even at temperature 0. This variance has critical implications: reported improvements of 2--3 percentage points may reflect evaluation noise rather than genuine algorithmic progress. Through token-level analysis, we show that trajectories diverge early, often within the first few percent of tokens, and that these small differences cascade into different solution strategies. To enable reliable evaluation of agentic systems, we recommend three concrete practices: (1) estimate pass@1 from multiple independent runs per task, especially when measuring small improvements, (2) use statistical power analysis to determine the number of runs needed to detect expected effect sizes, and (3) consider metrics like pass@k (optimistic bound) and pass^k (pessimistic bound) with k>1 to better characterize the full performance envelope. While these practices increase evaluation cost, they are essential for distinguishing genuine scientific progress from statistical noise.
SEFeb 10, 2022Code
Spork: Structured Merge for Java with Formatting PreservationSimon Larsén, Jean-Rémy Falleri, Benoit Baudry et al.
The highly parallel workflows of modern software development have made merging of source code a common activity for developers. The state of the practice is based on line-based merge, which is ubiquitously used with "git merge". Line-based merge is however a generalized technique for any text that cannot leverage the structured nature of source code, making merge conflicts a common occurrence. As a remedy, research has proposed structured merge tools, which typically operate on abstract syntax trees instead of raw text. Structured merging greatly reduces the prevalence of merge conflicts but suffers from important limitations, the main ones being a tendency to alter the formatting of the merged code and being prone to excessive running times. In this paper, we present SPORK, a novel structured merge tool for JAVA. SPORK is unique as it preserves formatting to a significantly greater degree than comparable state-of-the-art tools. SPORK is also overall faster than the state of the art, in particular significantly reducing worst-case running times in practice. We demonstrate these properties by replaying 1740 real-world file merges collected from 119 open-source projects, and further demonstrate several key differences between SPORK and the state of the art with in-depth case studies.
SEDec 15, 2021Code
Harvesting Production GraphQL Queries to Detect Schema FaultsLouise Zetterlund, Deepika Tiwari, Martin Monperrus et al.
GraphQL is a new paradigm to design web APIs. Despite its growing popularity, there are few techniques to verify the implementation of a GraphQL API. We present a new testing approach based on GraphQL queries that are logged while users interact with an application in production. Our core motivation is that production queries capture real usages of the application, and are known to trigger behavior that may not be tested by developers. For each logged query, a test is generated to assert the validity of the GraphQL response with respect to the schema. We implement our approach in a tool called AutoGraphQL, and evaluate it on two real-world case studies that are diverse in their domain and technology stack: an open-source e-commerce application implemented in Python called Saleor, and an industrial case study which is a PHP-based finance website called Frontapp. AutoGraphQL successfully generates test cases for the two applications. The generated tests cover 26.9% of the Saleor schema, including parts of the API not exercised by the original test suite, as well as 48.7% of the Frontapp schema, detecting 8 schema faults, thanks to production queries.
SEDec 2, 2020Code
Production Monitoring to Improve Test SuitesDeepika Tiwari, Long Zhang, Martin Monperrus et al.
In this paper, we propose to use production executions to improve the quality of testing for certain methods of interest for developers. These methods can be methods that are not covered by the existing test suite, or methods that are poorly tested. We devise an approach called PANKTI which monitors applications as they execute in production, and then automatically generates differential unit tests, as well as derived oracles, from the collected data. PANKTI's monitoring and generation focuses on one single programming language, Java. We evaluate it on three real-world, open-source projects: a videoconferencing system, a PDF manipulation library, and an e-commerce application. We show that PANKTI is able to generate differential unit tests by monitoring target methods in production, and that the generated tests improve the quality of the test suite of the application under consideration.
SEAug 17, 2020Code
CROW: Code Diversification for WebAssemblyJavier Cabrera Arteaga, Orestis Malivitsis, Oscar Vera Pérez et al.
The adoption of WebAssembly has rapidly increased in the last few years as it provides a fast and safe model for program execution. However, WebAssembly is not exempt from vulnerabilities that could be exploited by side channels attacks. This class of vulnerabilities that can be addressed by code diversification. In this paper, we present the first fully automated workflow for the diversification of WebAssembly binaries. We present CROW, an open-source tool implementing this workflow. We evaluate CROW's capabilities on 303 C programs and study its use on a real-life security-sensitive program: libsodium, a cryptographic library. Overall, CROWis able to generate diverse variants for 239 out of 303,(79%) small programs. Furthermore, our experiments show that our approach and tool is able to successfully diversify off-the-shelf cryptographic software (libsodium).
SEJan 21, 2020Code
A Comprehensive Study of Bloated Dependencies in the Maven EcosystemCésar Soto-Valero, Nicolas Harrand, Martin Monperrus et al.
Build automation tools and package managers have a profound influence on software development. They facilitate the reuse of third-party libraries, support a clear separation between the application's code and its external dependencies, and automate several software development tasks. However, the wide adoption of these tools introduces new challenges related to dependency management. In this paper, we propose an original study of one such challenge: the emergence of bloated dependencies. Bloated dependencies are libraries that the build tool packages with the application's compiled code but that are actually not necessary to build and run the application. This phenomenon artificially grows the size of the built binary and increases maintenance effort. We propose a tool, called DepClean, to analyze the presence of bloated dependencies in Maven artifacts. We analyze 9,639 Java artifacts hosted on Maven Central, which include a total of 723,444 dependency relationships. Our key result is that 75.1% of the analyzed dependency relationships are bloated. In other words, it is feasible to reduce the number of dependencies of Maven artifacts up to 1/4 of its current count. We also perform a qualitative study with 30 notable open-source projects. Our results indicate that developers pay attention to their dependencies and are willing to remove bloated dependencies: 18/21 answered pull requests were accepted and merged by developers, removing 131 dependencies in total.
SEDec 14, 2019Code
Automatic Observability for Dockerized Java ApplicationsLong Zhang, Deepika Tiwari, Brice Morin et al.
Docker is a virtualization technique heavily used in the industry to build cloud-based systems. In the context of Docker, a system is said to be observable if engineers can get accurate information about its running state in production. In this paper, we present a novel approach, called POBS, to automatically improve the observability of Dockerized Java applications. POBS is based on automated transformations of Docker configuration files. Our approach injects additional modules in the production application, in order to provide better observability. We evaluate POBS by applying it on open-source Java applications which are containerized with Docker. Our key result is that 148/170 (87%) of Docker Java containers can be automatically augmented with better observability.
SEOct 11, 2019Code
Repairnator patches programs automaticallyMartin Monperrus, Simon Urli, Thomas Durieux et al.
Repairnator is a bot. It constantly monitors software bugs discovered during continuous integration of open-source software and tries to fix them automatically. If it succeeds in synthesizing a valid patch, Repairnator proposes the patch to the human developers, disguised under a fake human identity. To date, Repairnator has been able to producepatches that were accepted by the human developers and permanently merged into the code base. This is a milestone for human-competitiveness in software engineering research on automatic program repair.
SEAug 19, 2019Code
The Strengths and Behavioral Quirks of Java Bytecode DecompilersNicolas Harrand, César Soto-Valero, Martin Monperrus et al.
During compilation from Java source code to bytecode, some information is irreversibly lost. In other words, compilation and decompilation of Java code is not symmetric. Consequently, the decompilation process, which aims at producing source code from bytecode, must establish some strategies to reconstruct the information that has been lost. Modern Java decompilers tend to use distinct strategies to achieve proper decompilation. In this work, we hypothesize that the diverse ways in which bytecode can be decompiled has a direct impact on the quality of the source code produced by decompilers. We study the effectiveness of eight Java decompilers with respect to three quality indicators: syntactic correctness, syntactic distortion and semantic equivalence modulo inputs. This study relies on a benchmark set of 14 real-world open-source software projects to be decompiled (2041 classes in total). Our results show that no single modern decompiler is able to correctly handle the variety of bytecode structures coming from real-world programs. Even the highest ranking decompiler in this study produces syntactically correct output for 84% of classes of our dataset and semantically equivalent code output for 78% of classes.
SEMay 7, 2019Code
Explainable Software Bot Contributions: Case Study of Automated Bug FixesMartin Monperrus
In a software project, esp. in open-source, a contribution is a valuable piece of work made to the project: writing code, reporting bugs, translating, improving documentation, creating graphics, etc. We are now at the beginning of an exciting era where software bots will make contributions that are of similar nature than those by humans. Dry contributions, with no explanation, are often ignored or rejected, because the contribution is not understandable per se, because they are not put into a larger context, because they are not grounded on idioms shared by the core community of developers. We have been operating a program repair bot called Repairnator for 2 years and noticed the problem of "dry patches": a patch that does not say which bug it fixes, or that does not explain the effects of the patch on the system. We envision program repair systems that produce an "explainable bug fix": an integrated package of at least 1) a patch, 2) its explanation in natural or controlled language, and 3) a highlight of the behavioral difference with examples. In this paper, we generalize and suggest that software bot contributions must explainable, that they must be put into the context of the global software development conversation.
SEApr 20, 2019Code
An Analysis of 35+ Million Jobs of Travis CIThomas Durieux, Rui Abreu, Martin Monperrus et al.
Travis CI handles automatically thousands of builds every day to, amongst other things, provide valuable feedback to thousands of open-source developers. In this paper, we investigate Travis CI to firstly understand who is using it, and when they start to use it. Secondly, we investigate how the developers use Travis CI and finally, how frequently the developers change the Travis CI configurations. We observed during our analysis that the main users of Travis CI are corporate users such as Microsoft. And the programming languages used in Travis CI by those users do not follow the same popularity trend than on GitHub, for example, Python is the most popular language on Travis CI, but it is only the third one on GitHub. We also observe that Travis CI is set up on average seven days after the creation of the repository and the jobs are still mainly used (60%) to run tests. And finally, we observe that 7.34% of the commits modify the Travis CI configuration. We share the biggest benchmark of Travis CI jobs (to our knowledge): it contains 35,793,144 jobs from 272,917 different GitHub projects.
LGApr 5, 2019Code
A Literature Study of Embeddings on Source CodeZimin Chen, Martin Monperrus
Natural language processing has improved tremendously after the success of word embedding techniques such as word2vec. Recently, the same idea has been applied on source code with encouraging results. In this survey, we aim to collect and discuss the usage of word embedding techniques on programs and source code. The articles in this survey have been collected by asking authors of related work and with an extensive search on Google Scholar. Each article is categorized into five categories: 1. embedding of tokens 2. embedding of functions or methods 3. embedding of sequences or sets of method calls 4. embedding of binary code 5. other embeddings. We also provide links to experimental data and show some remarkable visualization of code embeddings. In summary, word embedding has been successfully applied on different granularities of source code. With access to countless open-source repositories, we see a great potential of applying other data-driven natural language processing techniques on source code in the future.
SEFeb 22, 2019Code
An Approach and Benchmark to Detect Behavioral Changes of Commits in Continuous IntegrationBenjamin Danglot, Martin Monperrus, Walter Rudametkin et al.
When a developer pushes a change to an application's codebase, a good practice is to have a test case specifying this behavioral change. Thanks to continuous integration (CI), the test is run on subsequent commits to check that they do no introduce a regression for that behavior. In this paper, we propose an approach that detects behavioral changes in commits. As input, it takes a program, its test suite, and a commit. Its output is a set of test methods that capture the behavioral difference between the pre-commit and post-commit versions of the program. We call our approach DCI (Detecting behavioral changes in CI). It works by generating variations of the existing test cases through (i) assertion amplification and (ii) a search-based exploration of the input space. We evaluate our approach on a curated set of 60 commits from 6 open source Java projects. To our knowledge, this is the first ever curated dataset of real-world behavioral changes. Our evaluation shows that DCI is able to generate test methods that detect behavioral changes. Our approach is fully automated and can be integrated into current development processes. The main limitations are that it targets unit tests and works on a relatively small fraction of commits. More specifically, DCI works on commits that have a unit test that already executes the modified code. In practice, from our benchmark projects, we found 15.29% of commits to meet the conditions required by DCI.
SEJan 17, 2019Code
Bears: An Extensible Java Bug Benchmark for Automatic Program Repair StudiesFernanda Madeiral, Simon Urli, Marcelo Maia et al.
Benchmarks of bugs are essential to empirically evaluate automatic program repair tools. In this paper, we present Bears, a project for collecting and storing bugs into an extensible bug benchmark for automatic repair studies in Java. The collection of bugs relies on commit building state from Continuous Integration (CI) to find potential pairs of buggy and patched program versions from open-source projects hosted on GitHub. Each pair of program versions passes through a pipeline where an attempt of reproducing a bug and its patch is performed. The core step of the reproduction pipeline is the execution of the test suite of the program on both program versions. If a test failure is found in the buggy program version candidate and no test failure is found in its patched program version candidate, a bug and its patch were successfully reproduced. The uniqueness of Bears is the usage of CI (builds) to identify buggy and patched program version candidates, which has been widely adopted in the last years in open-source projects. This approach allows us to collect bugs from a diversity of projects beyond mature projects that use bug tracking systems. Moreover, Bears was designed to be publicly available and to be easily extensible by the research community through automatic creation of branches with bugs in a given GitHub repository, which can be used for pull requests in the Bears repository. We present in this paper the approach employed by Bears, and we deliver the version 1.0 of Bears, which contains 251 reproducible bugs collected from 72 projects that use the Travis CI and Maven build environment.
SEDec 24, 2018Code
SequenceR: Sequence-to-Sequence Learning for End-to-End Program RepairZimin Chen, Steve Kommrusch, Michele Tufano et al.
This paper presents a novel end-to-end approach to program repair based on sequence-to-sequence learning. We devise, implement, and evaluate a system, called SequenceR, for fixing bugs based on sequence-to-sequence learning on source code. This approach uses the copy mechanism to overcome the unlimited vocabulary problem that occurs with big code. Our system is data-driven; we train it on 35,578 samples, carefully curated from commits to open-source repositories. We evaluate it on 4,711 independent real bug fixes, as well on the Defects4J benchmark used in program repair research. SequenceR is able to perfectly predict the fixed line for 950/4711 testing samples, and find correct patches for 14 bugs in Defects4J. It captures a wide range of repair operators without any domain-specific top-down design.
SEDec 15, 2018Code
A Large-Scale Study of Call Graph-based Impact Prediction using Mutation TestingVincenzo Musco, Martin Monperrus, Philippe Preux
In software engineering, impact analysis involves predicting the software elements (e.g., modules, classes, methods) potentially impacted by a change in the source code. Impact analysis is required to optimize the testing effort. In this paper, we propose an evaluation technique to predict impact propagation. Based on 10 open-source Java projects and 5 classical mutation operators, we create 17,000 mutants and study how the error they introduce propagates. This evaluation technique enables us to analyze impact prediction based on four types of call graph. Our results show that graph sophistication increases the completeness of impact prediction. However, and surprisingly to us, the most basic call graph gives the best trade-off between precision and recall for impact prediction.
SENov 24, 2018Code
How to Design a Program Repair Bot? Insights from the Repairnator ProjectSimon Urli, Zhongxing Yu, Lionel Seinturier et al.
Program repair research has made tremendous progress over the last few years, and software development bots are now being invented to help developers gain productivity. In this paper, we investigate the concept of a " program repair bot " and present Repairnator. The Repairnator bot is an autonomous agent that constantly monitors test failures, reproduces bugs, and runs program repair tools against each reproduced bug. If a patch is found, Repairnator bot reports it to the developers. At the time of writing, Repairnator uses three different program repair systems and has been operating since February 2017. In total, it has studied 11 317 test failures over 1 609 open-source software projects hosted on GitHub, and has generated patches for 17 different bugs. Over months, we hit a number of hard technical challenges and had to make various design and engineering decisions. This gives us a unique experience in this area. In this paper, we reflect upon Repairnator in order to share this knowledge with the automatic program repair community.
SENov 20, 2018Code
Automatic Test Improvement with DSpot: a Study with Ten Mature Open-Source ProjectsBenjamin Danglot, Oscar Luis Vera-Pérez, Benoit Baudry et al.
In the literature, there is a rather clear segregation between manually written tests by developers and automatically generated ones. In this paper, we explore a third solution: to automatically improve existing test cases written by developers. We present the concept, design, and implementation of a system called \dspot, that takes developer-written test cases as input (junit tests in Java) and synthesizes improved versions of them as output. Those test improvements are given back to developers as patches or pull requests, that can be directly integrated in the main branch of the test code base. We have evaluated DSpot in a deep, systematic manner over 40 real-world unit test classes from 10 notable and open-source software projects. We have amplified all test methods from those 40 unit test classes. In 26/40 cases, DSpot is able to automatically improve the test under study, by triggering new behaviors and adding new valuable assertions. Next, for ten projects under consideration, we have proposed a test improvement automatically synthesized by \dspot to the lead developers. In total, 13/19 proposed test improvements were accepted by the developers and merged into the main code base. This shows that DSpot is capable of automatically improving unit-tests in real-world, large-scale Java software.
SENov 10, 2018Code
Nopol: Automatic Repair of Conditional Statement Bugs in Java ProgramsJifeng Xuan, Matias Martinez, Favio Demarco et al.
We propose NOPOL, an approach to automatic repair of buggy conditional statements (i.e., if-then-else statements). This approach takes a buggy program as well as a test suite as input and generates a patch with a conditional expression as output. The test suite is required to contain passing test cases to model the expected behavior of the program and at least one failing test case that reveals the bug to be repaired. The process of NOPOL consists of three major phases. First, NOPOL employs angelic fix localization to identify expected values of a condition during the test execution. Second, runtime trace collection is used to collect variables and their actual values, including primitive data types and objected-oriented features (e.g., nullness checks), to serve as building blocks for patch generation. Third, NOPOL encodes these collected data into an instance of a Satisfiability Modulo Theory (SMT) problem, then a feasible solution to the SMT instance is translated back into a code patch. We evaluate NOPOL on 22 real-world bugs (16 bugs with buggy IF conditions and 6 bugs with missing preconditions) on two large open-source projects, namely Apache Commons Math and Apache Commons Lang. Empirical analysis on these bugs shows that our approach can effectively fix bugs with buggy IF conditions and missing preconditions. We illustrate the capabilities and limitations of NOPOL using case studies of real bug fixes.
SENov 7, 2018Code
Descartes: A PITest Engine to Detect Pseudo-Tested Methods - Tool DemonstrationOscar Luis Vera-Pérez, Martin Monperrus, Benoit Baudry
Descartes is a tool that implements extreme mutation operators and aims at finding pseudo-tested methods in Java projects. It leverages the efficient transformation and runtime features of PIT. The demonstration compares Descartes with Gregor, the default mutation engine provided by PIT, in a set of real open source projects. It considers the execution time, number of mutants created and the relationship between the mutation scores produced by both engines. It provides some insights on the main features exposed by Descartes.
SEOct 19, 2018Code
Coming: a Tool for Mining Change Pattern Instances from Git CommitsMatias Martinez, Martin Monperrus
Software repositories such as Git have become a relevant source of information for software engineer researcher. For instance, the detection of Commits that fulfill a given criterion (e.g., bugfixing commits) is one of the most frequent tasks done to understand the software evolution. However, to our knowledge, there is not open-source tools that, given a Git repository, returns all the instances of a given change pattern. In this paper we present Coming, a tool that takes an input a Git repository and mines instances of change patterns on each commit. For that, Coming computes fine-grained changes between two consecutive revisions, analyzes those changes to detect if they correspond to an instance of a change pattern (specified by the user using XML), and finally, after analyzing all the commits, it presents a) the frequency of code changes and b) the instances found on each commit. We evaluate Coming on a set of 28 pairs of revisions from Defects4J, finding instances of change patterns that involve If conditions on 26 of them.
SEOct 13, 2018Code
Human-competitive Patches in Automatic Program Repair with RepairnatorMartin Monperrus, Simon Urli, Thomas Durieux et al.
Repairnator is a bot. It constantly monitors software bugs discovered during continuous integration of open-source software and tries to fix them automatically. If it succeeds to synthesize a valid patch, Repairnator proposes the patch to the human developers, disguised under a fake human identity. To date, Repairnator has been able to produce 5 patches that were accepted by the human developers and permanently merged in the code base. This is a milestone for human-competitiveness in software engineering research on automatic program repair.
SEOct 3, 2018Code
FixMiner: Mining Relevant Fix Patterns for Automated Program RepairAnil Koyuncu, Kui Liu, Tegawendé F. Bissyandé et al.
Patching is a common activity in software development. It is generally performed on a source code base to address bugs or add new functionalities. In this context, given the recurrence of bugs across projects, the associated similar patches can be leveraged to extract generic fix actions. While the literature includes various approaches leveraging similarity among patches to guide program repair, these approaches often do not yield fix patterns that are tractable and reusable as actionable input to APR systems. In this paper, we propose a systematic and automated approach to mining relevant and actionable fix patterns based on an iterative clustering strategy applied to atomic changes within patches. The goal of FixMiner is thus to infer separate and reusable fix patterns that can be leveraged in other patch generation systems. Our technique, FixMiner, leverages Rich Edit Script which is a specialized tree structure of the edit scripts that captures the AST-level context of the code changes. FixMiner uses different tree representations of Rich Edit Scripts for each round of clustering to identify similar changes. These are abstract syntax trees, edit actions trees, and code context trees. We have evaluated FixMiner on thousands of software patches collected from open source projects. Preliminary results show that we are able to mine accurate patterns, efficiently exploiting change information in Rich Edit Scripts. We further integrated the mined patterns to an automated program repair prototype, PARFixMiner, with which we are able to correctly fix 26 bugs of the Defects4J benchmark. Beyond this quantitative performance, we show that the mined fix patterns are sufficiently relevant to produce patches with a high probability of correctness: 81% of PARFixMiner's generated plausible patches are correct.
SEJul 6, 2018Code
The CodRep Machine Learning on Source Code CompetitionZimin Chen, Martin Monperrus
CodRep is a machine learning competition on source code data. It is carefully designed so that anybody can enter the competition, whether professional researchers, students or independent scholars, without specific knowledge in machine learning or program analysis. In particular, it aims at being a common playground on which the machine learning and the software engineering research communities can interact. The competition has started on April 14th 2018 and has ended on October 14th 2018. The CodRep data is hosted at https://github.com/KTH/CodRep-competition/.
SEMay 4, 2018Code
Characterizing the Usage, Evolution and Impact of Java Annotations in PracticeZhongxing Yu, Chenggang Bai, Lionel Seinturier et al.
Annotations have been formally introduced into Java since Java 5. Since then, annotations have been widely used by the Java community for different purposes, such as compiler guidance and runtime processing. Despite the ever-growing use, there is still limited empirical knowledge about the actual usage of annotations in practice, the changes made to annotations during software evolution, and the potential impact of annotations on code quality. To fill this gap, we perform the first large-scale empirical study about Java annotations on 1,094 notable open-source projects hosted on GitHub. Our study systematically investigates annotation usage, annotation evolution, and annotation impact, and generates 10 novel and important findings. We also present the implications of our findings, which shed light for developers, researchers, tool builders, and language or library designers in order to improve all facets of Java annotation engineering.
SEOct 25, 2017Code
Exhaustive Exploration of the Failure-oblivious Computing Search SpaceThomas Durieux, Youssef Hamadi, Zhongxing Yu et al.
High-availability of software systems requires automated handling of crashes in presence of errors. Failure-oblivious computing is one technique that aims to achieve high availability. We note that failure-obliviousness has not been studied in depth yet, and there is very few study that helps understand why failure-oblivious techniques work. In order to make failure-oblivious computing to have an impact in practice, we need to deeply understand failure-oblivious behaviors in software. In this paper, we study, design and perform an experiment that analyzes the size and the diversity of the failure-oblivious behaviors. Our experiment consists of exhaustively computing the search space of 16 field failures of large-scale open-source Java software. The outcome of this experiment is a much better understanding of what really happens when failure-oblivious computing is used, and this opens new promising research directions.
SEJul 15, 2017Code
Sorting and Transforming Program Repair Ingredients via Deep Learning Code SimilaritiesMartin White, Michele Tufano, Matias Martinez et al.
In the field of automated program repair, the redundancy assumption claims large programs contain the seeds of their own repair. However, most redundancy-based program repair techniques do not reason about the repair ingredients---the code that is reused to craft a patch. We aim to reason about the repair ingredients by using code similarities to prioritize and transform statements in a codebase for patch generation. Our approach, DeepRepair, relies on deep learning to reason about code similarities. Code fragments at well-defined levels of granularity in a codebase can be sorted according to their similarity to suspicious elements (i.e., code elements that contain suspicious statements) and statements can be transformed by mapping out-of-scope identifiers to similar identifiers in scope. We examined these new search strategies for patch generation with respect to effectiveness from the viewpoint of a software maintainer. Our comparative experiments were executed on six open-source Java projects including 374 buggy program revisions and consisted of 19,949 trials spanning 2,616 days of computation time. DeepRepair's search strategy using code similarities generally found compilable ingredients faster than the baseline, jGenProg, but this improvement neither yielded test-adequate patches in fewer attempts (on average) nor found significantly more patches than the baseline. Although the patch counts were not statistically different, there were notable differences between the nature of DeepRepair patches and baseline patches. The results demonstrate that our learning-based approach finds patches that cannot be found by existing redundancy-based repair techniques.
SESep 22, 2016Code
Production-Driven Patch Generation and ValidationThomas Durieux, Youssef Hamadi, Martin Monperrus
We envision a world where the developer would receive each morning in her GitHub dashboard a list of potential patches that fix certain production failures. For this, we propose a novel program repair scheme, with the unique feature of being applicable to production directly. We present the design and implementation of a prototype system for Java, called Itzal, that performs patch generation for uncaught exceptions in production. We have performed two empirical experiments to validate our system: the first one on 34 failures from 14 different software applications, the second one on 16 seeded failures in 3 real open-source e-commerce applications for which we have set up a realistic user traffic. This validates the novel and disruptive idea of using program repair directly in production.
SESep 1, 2015Code
Automatic Software Diversity in the Light of Test SuitesBenoit Baudry, Simon Allier, Marcelino Rodriguez-Cancio et al.
A few works address the challenge of automating software diversification, and they all share one core idea: using automated test suites to drive diversification. However, there is is lack of solid understanding of how test suites, programs and transformations interact one with another in this process. We explore this intricate interplay in the context of a specific diversification technique called "sosiefication". Sosiefication generates sosie programs, i.e., variants of a program in which some statements are deleted, added or replaced but still pass the test suite of the original program. Our investigation of the influence of test suites on sosiefication exploits the following observation: test suites cover the different regions of programs in very unequal ways. Hence, we hypothesize that sosie synthesis has different performances on a statement that is covered by one hundred test case and on a statement that is covered by a single test case. We synthesize 24583 sosies on 6 popular open-source Java programs. Our results show that there are two dimensions for diversification. The first one lies in the specification: the more test cases cover a statement, the more difficult it is to synthesize sosies. Yet, to our surprise, we are also able to synthesize sosies on highly tested statements (up to 600 test cases), which indicates an intrinsic property of the programs we study. The second dimension is in the code: we manually explore dozens of sosies and characterize new types of forgiving code regions that are prone to diversification.
SEMar 19, 2015Code
DSpot: Test Amplification for Automatic Assessment of Computational DiversityBenoit Baudry, Simon Allier, Marcelino Rodriguez-Cancio et al.
Context: Computational diversity, i.e., the presence of a set of programs that all perform compatible services but that exhibit behavioral differences under certain conditions, is essential for fault tolerance and security. Objective: We aim at proposing an approach for automatically assessing the presence of computational diversity. In this work, computationally diverse variants are defined as (i) sharing the same API, (ii) behaving the same according to an input-output based specification (a test-suite) and (iii) exhibiting observable differences when they run outside the specified input space. Method: Our technique relies on test amplification. We propose source code transformations on test cases to explore the input domain and systematically sense the observation domain. We quantify computational diversity as the dissimilarity between observations on inputs that are outside the specified domain. Results: We run our experiments on 472 variants of 7 classes from open-source, large and thoroughly tested Java classes. Our test amplification multiplies by ten the number of input points in the test suite and is effective at detecting software diversity. Conclusion: The key insights of this study are: the systematic exploration of the observable output space of a class provides new insights about its degree of encapsulation; the behavioral diversity that we observe originates from areas of the code that are characterized by their flexibility (caching, checking, formatting, etc.).
SEFeb 6, 2015Code
Casper: Debugging Null Dereferences with Dynamic Causality TracesBenoit Cornu, Earl T. Barr, Lionel Seinturier et al.
Fixing a software error requires understanding its root cause. In this paper, we introduce ''causality traces'', crafted execution traces augmented with the information needed to reconstruct the causal chain from the root cause of a bug to an execution error. We propose an approach and a tool, called Casper, for dynamically constructing causality traces for null dereference errors. The core idea of Casper is to inject special values, called ''ghosts'', into the execution stream to construct the causality trace at runtime. We evaluate our contribution by providing and assessing the causality traces of 14 real null dereference bugs collected over six large, popular open-source projects. Over this data set, Casper builds a causality trace in less than 5 seconds.
SEOct 24, 2014Code
ASTOR: Evolutionary Automatic Software Repair for JavaMatias Martinez, Martin Monperrus
Context: During last years, many automatic software repair approaches have been presented by the software engineering research community. According to the corresponding papers, these approaches are able to repair real defects from open source projects. Problematic: Some previous publications in the automatic repair field do not provide the implementation of theirs approaches. Consequently, it is not possible for the research community to re-execute the original evaluation, to set up new evaluations (for example, to evaluate the performance against new defects) or to compare approaches against each others. Solution: We propose a publicly available automatic software repair tool called Astor. It implements three state-of-the-art automatic software repair approaches in the context of Java programs (including GenProg and a subset of PAR's templates). The source code of Astor is licensed under the GNU General Public Licence (GPL v2).
SESep 10, 2014Code
Test Case Purification for Improving Fault LocalizationJifeng Xuan, Martin Monperrus
Finding and fixing bugs are time-consuming activities in software development. Spectrum-based fault localization aims to identify the faulty position in source code based on the execution trace of test cases. Failing test cases and their assertions form test oracles for the failing behavior of the system under analysis. In this paper, we propose a novel concept of spectrum driven test case purification for improving fault localization. The goal of test case purification is to separate existing test cases into small fractions (called purified test cases) and to enhance the test oracles to further localize faults. Combining with an original fault localization technique (e.g., Tarantula), test case purification results in better ranking the program statements. Our experiments on 1800 faults in six open-source Java programs show that test case purification can effectively improve existing fault localization techniques.
SEJan 29, 2014Code
Tailored Source Code Transformations to Synthesize Computationally Diverse Program VariantsBenoit Baudry, Simon Allier, Martin Monperrus
The predictability of program execution provides attackers a rich source of knowledge who can exploit it to spy or remotely control the program. Moving target defense addresses this issue by constantly switching between many diverse variants of a program, which reduces the certainty that an attacker can have about the program execution. The effectiveness of this approach relies on the availability of a large number of software variants that exhibit different executions. However, current approaches rely on the natural diversity provided by off-the-shelf components, which is very limited. In this paper, we explore the automatic synthesis of large sets of program variants, called sosies. Sosies provide the same expected functionality as the original program, while exhibiting different executions. They are said to be computationally diverse. This work addresses two objectives: comparing different transformations for increasing the likelihood of sosie synthesis (densifying the search space for sosies); demonstrating computation diversity in synthesized sosies. We synthesized 30184 sosies in total, for 9 large, real-world, open source applications. For all these programs we identified one type of program analysis that systematically increases the density of sosies; we measured computation diversity for sosies of 3 programs and found diversity in method calls or data in more than 40% of sosies. This is a step towards controlled massive unpredictability of software.
SESep 15, 2013Code
Automatically Extracting Instances of Code Change Patterns with AST AnalysisMatias Martinez, Laurence Duchien, Martin Monperrus
A code change pattern represents a kind of recurrent modification in software. For instance, a known code change pattern consists of the change of the conditional expression of an if statement. Previous work has identified different change patterns. Complementary to the identification and definition of change patterns, the automatic extraction of pattern instances is essential to measure their empirical importance. For example, it enables one to count and compare the number of conditional expression changes in the history of different projects. In this paper we present a novel approach for search patterns instances from software history. Our technique is based on the analysis of Abstract Syntax Trees (AST) files within a given commit. We validate our approach by counting instances of 18 change patterns in 6 open-source Java projects.
CRNov 4, 2025
PoCo: Agentic Proof-of-Concept Exploit Generation for Smart ContractsVivi Andersson, Sofia Bobadilla, Harald Hobbelhagen et al.
Smart contracts operate in a highly adversarial environment, where vulnerabilities can lead to substantial financial losses. Thus, smart contracts are subject to security audits. In auditing, proof-of-concept (PoC) exploits play a critical role by demonstrating to the stakeholders that the reported vulnerabilities are genuine, reproducible, and actionable. However, manually creating PoCs is time-consuming, error-prone, and often constrained by tight audit schedules. We introduce POCO, an agentic framework that automatically generates executable PoC exploits from natural-language vulnerability descriptions written by auditors. POCO autonomously generates PoC exploits in an agentic manner by interacting with a set of code-execution tools in a Reason-Act-Observe loop. It produces fully executable exploits compatible with the Foundry testing framework, ready for integration into audit reports and other security tools. We evaluate POCO on a dataset of 23 real-world vulnerability reports. POCO consistently outperforms the prompting and workflow baselines, generating well-formed and logically correct PoCs. Our results demonstrate that agentic frameworks can significantly reduce the effort required for high-quality PoCs in smart contract audits. Our contribution provides readily actionable knowledge for the smart contract security community.
SEDec 25, 2023
RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program RepairAndré Silva, Sen Fang, Martin Monperrus
Automated Program Repair (APR) has evolved significantly with the advent of Large Language Models (LLMs). Fine-tuning LLMs for program repair is a recent avenue of research, with many dimensions which have not been explored. Existing work mostly fine-tune LLMs with naive code representations and does not scale to frontier models. To address this problem, we propose RepairLLaMA, a novel program repair approach that 1) identifies optimal code representations for APR with fine-tuned models, and 2) pioneers state-of-the-art parameter-efficient fine-tuning technique (PEFT) for program repair. This results in RepairLLaMA producing a highly effective `program repair adapter' for fixing bugs with AI. Our experiments demonstrate the validity of both concepts. First, fine-tuning adapters with program repair specific code representations enables the model to use meaningful repair signals and produce better patches. Second, parameter-efficient fine-tuning helps fine-tuning to converge and clearly contributes to the effectiveness of RepairLLaMA in fixing bugs outside the fine-tuning data distribution. Overall, RepairLLaMA correctly fixes 144 Defects4J v2, 109 HumanEval-Java, and 20 GitBug-Java bugs, outperforming all baselines.
CRNov 15, 2025
Software Supply Chain Security of Web3Martin Monperrus
Web3 applications, built on blockchain technology, manage billions of dollars in digital assets through decentralized applications (dApps) and smart contracts. These systems rely on complex, software supply chains that introduce significant security vulnerabilities. This paper examines the software supply chain security challenges unique to the Web3 ecosystem, where traditional Web2 software supply chain problems intersect with the immutable and high-stakes nature of blockchain technology. We analyze the threat landscape and propose mitigation strategies to strengthen the security posture of Web3 systems.
SEFeb 9, 2024
CigaR: Cost-efficient Program Repair with LLMsDávid Hidvégi, Khashayar Etemadi, Sofia Bobadilla et al.
Large language models (LLM) have proven to be effective at automated program repair (APR). However, using LLMs can be costly, with companies invoicing users by the number of tokens. In this paper, we propose CigaR, the first LLM-based APR tool that focuses on minimizing the repair cost. CigaR works in two major steps: generating a first plausible patch and multiplying plausible patches. CigaR optimizes the prompts and the prompt setting to maximize the information given to LLMs using the smallest possible number of tokens. Our experiments on 429 bugs from the widely used Defects4J and HumanEval-Java datasets shows that CigaR reduces the token cost by 73%. On average, CigaR spends 127k tokens per bug while the baseline uses 467k tokens per bug. On the subset of bugs that are fixed by both, CigaR spends 20k per bug while the baseline uses 608k tokens, a cost saving of 96%. Our extensive experiments show that CigaR is a cost-effective LLM-based program repair tool that uses a low number of tokens to automatically generate patches.