Justyna Petke

8papers

91citations

Novelty36%

AI Score41

Ranked #89,748 of 201,326 authors (top 45%)#1,090 in SE (top 32%)

8 Papers

SEMar 11Code

Unveiling Practical Shortcomings of Patch Overfitting Detection Techniques

David Williams, Ioakim Avraam, Aldeida Aleti et al.

Automated Program Repair (APR) can reduce the time developers spend debugging, allowing them to focus on other aspects of software development. Automatically generated bug patches are typically validated through software testing. However, this method can lead to patch overfitting, i.e., generating patches that pass the given tests but are still incorrect. Patch correctness assessment (also known as overfitting detection) techniques have been proposed to identify patches that overfit. However, prior work often assessed the effectiveness of these techniques in isolation and on datasets that do not reflect the distribution of correct-to-overfitting patches that would be generated by APR tools in typical use; thus, we still do not know their effectiveness in practice. This work presents the first comprehensive benchmarking study of several patch overfitting detection (POD) methods in a practical scenario. To this end, we curate datasets that reflect realistic assumptions (i.e., patches produced by tools run under the same experimental conditions). Next, we use these data to benchmark six state-of-the-art POD approaches -- spanning static analysis, dynamic testing, and learning-based approaches -- against two baselines based on random sampling (one from prior work and one proposed herein). Our results are striking: Simple random selection outperforms all POD tools for 71% to 96% of cases, depending on the POD tool. This suggests two main takeaways: (1) current POD tools offer limited practical benefit, highlighting the need for novel techniques; (2) any POD tool must be benchmarked on realistic data and against random sampling to prove its practical effectiveness. To this end, we encourage the APR community to continue improving POD techniques and to adopt our proposed methodology for practical benchmarking; we make our data and code available to facilitate such adoption.

SEAug 4, 2022

MAGPIE: Machine Automated General Performance Improvement via Evolution of Software

Aymeric Blot, Justyna Petke

Performance is one of the most important qualities of software. Several techniques have thus been proposed to improve it, such as program transformations, optimisation of software parameters, or compiler flags. Many automated software improvement approaches use similar search strategies to explore the space of possible improvements, yet available tooling only focuses on one approach at a time. This makes comparisons and exploration of interactions of the various types of improvement impractical. We propose MAGPIE, a unified software improvement framework. It provides a common edit sequence based representation that isolates the search process from the specific improvement technique, enabling a much simplified synergistic workflow. We provide a case study using a basic local search to compare compiler optimisation, algorithm configuration, and genetic improvement. We chose running time as our efficiency measure and evaluated our approach on four real-world software, written in C, C++, and Java. Our results show that, used independently, all techniques find significant running time improvements: up to 25% for compiler optimisation, 97% for algorithm configuration, and 61% for evolving source code using genetic improvement. We also show that up to 10% further increase in performance can be obtained with partial combinations of the variants found by the different techniques. Furthermore, the common representation also enables simultaneous exploration of all techniques, providing a competitive alternative to using each technique individually.

SEOct 18, 2023

Enhancing Genetic Improvement Mutations Using Large Language Models

Alexander E. I. Brownlee, James Callan, Karine Even-Mendoza et al.

Large language models (LLMs) have been successfully applied to software engineering tasks, including program repair. However, their application in search-based techniques such as Genetic Improvement (GI) is still largely unexplored. In this paper, we evaluate the use of LLMs as mutation operators for GI to improve the search process. We expand the Gin Java GI toolkit to call OpenAI's API to generate edits for the JCodec tool. We randomly sample the space of edits using 5 different edit types. We find that the number of patches passing unit tests is up to 75% higher with LLM-based edits than with standard Insert edits. Further, we observe that the patches found with LLMs are generally less diverse compared to standard edits. We ran GI with local search to find runtime improvements. Although many improving patches are found by LLM-enhanced GI, the best improving patch was found by standard GI.

SEApr 29

Hot Fixing in the Wild

Carol Hanna, Karine Even-Mendoza, W. B. Langdon et al.

Despite the operational importance of hot fixes, large-scale evidence on how they reshape routine maintenance workflows, particularly in the era of autonomous coding agents, remains limited. We analyse hot fixes present in over 61,000 GitHub repositories from the Hao-Li/AIDev dataset and find consistent patterns of urgency: reduced collaboration (typically a single contributor), smaller and more targeted changes (median 2-3 commits and files, with <10 line modifications), limited review (often fewer than two reviewers), and substantially fewer test file modifications than regular bug fixes, consistent with their urgency-driven character. Leveraging the same urgency contexts, we examine differences between human- and AI-agent-authored hot fixes, revealing over 10 distinct repair behaviours, thus offering insights into future human-automation collaboration for hot fixing. Our study is the first to empirically analyse hot fix code changes at scale using a repository-level operationalisation of urgency. The comparison of human and agentbehaviours delineates their distinct characteristics, providing a foundation for understanding hot fixing in real-world practice

SEAug 27, 2021

HyperGI: Automated Detection and Repair of Information Flow Leakage

Ibrahim Mesecan, Daniel Blackwell, David Clark et al.

Maintaining confidential information control in software is a persistent security problem where failure means secrets can be revealed via program behaviors. Information flow control techniques traditionally have been based on static or symbolic analyses -- limited in scalability and specialized to particular languages. When programs do leak secrets there are no approaches to automatically repair them unless the leak causes a functional test to fail. We present our vision for HyperGI, a genetic improvement framework tha detects, localizes and repairs information leakage. Key elements of HyperGI include (1) the use of two orthogonal test suites, (2) a dynamic leak detection approach which estimates and localizes potential leaks, and (3) a repair component that produces a candidate patch using genetic improvement. We demonstrate the successful use of HyperGI on several programs which have no failing functional tests. We manually examine the resulting patches and identify trade-offs and future directions for fully realizing our vision.

SEJul 31, 2020

Genetic Improvement @ ICSE 2020

William B. Langdon, Westley Weimer, Justyna Petke et al.

Following Prof. Mark Harman of Facebook's keynote and formal presentations (which are recorded in the proceedings) there was a wide ranging discussion at the eighth international Genetic Improvement workshop, GI-2020 @ ICSE (held as part of the 42nd ACM/IEEE International Conference on Software Engineering on Friday 3rd July 2020). Topics included industry take up, human factors, explainabiloity (explainability, justifyability, exploitability) and GI benchmarks. We also contrast various recent online approaches (e.g. SBST 2020) to holding virtual computer science conferences and workshops via the WWW on the Internet without face-2-face interaction. Finally we speculate on how the Coronavirus Covid-19 Pandemic will affect research next year and into the future.

SEAug 7, 2019

A Survey of Constrained Combinatorial Testing

Huayao Wu, Changhai Nie, Justyna Petke et al.

Combinatorial Testing (CT) is a potentially powerful testing technique, whereas its failure revealing ability might be dramatically reduced if it fails to handle constraints in an adequate and efficient manner. To ensure the wider applicability of CT in the presence of constrained problem domains, large and diverse efforts have been invested towards the techniques and applications of constrained combinatorial testing. In this paper, we provide a comprehensive survey of representations, influences, and techniques that pertain to constraints in CT, covering 129 papers published between 1987 and 2018. This survey not only categorises the various constraint handling techniques, but also reviews comparatively less well-studied, yet potentially important, constraint identification and maintenance techniques. Since real-world programs are usually constrained, this survey can be of interest to researchers and practitioners who are looking to use and study constrained combinatorial testing techniques.

AIJan 18, 2014

Local Consistency and SAT-Solvers

Peter Jeavons, Justyna Petke

Local consistency techniques such as k-consistency are a key component of specialised solvers for constraint satisfaction problems. In this paper we show that the power of using k-consistency techniques on a constraint satisfaction problem is precisely captured by using a particular inference rule, which we call negative-hyper-resolution, on the standard direct encoding of the problem into Boolean clauses. We also show that current clause-learning SAT-solvers will discover in expected polynomial time any inconsistency that can be deduced from a given set of clauses using negative-hyper-resolvents of a fixed size. We combine these two results to show that, without being explicitly designed to do so, current clause-learning SAT-solvers efficiently simulate k-consistency techniques, for all fixed values of k. We then give some experimental results to show that this feature allows clause-learning SAT-solvers to efficiently solve certain families of constraint problems which are challenging for conventional constraint-programming solvers.