SEOct 18, 2023
Enhancing Genetic Improvement Mutations Using Large Language ModelsAlexander E. I. Brownlee, James Callan, Karine Even-Mendoza et al.
Large language models (LLMs) have been successfully applied to software engineering tasks, including program repair. However, their application in search-based techniques such as Genetic Improvement (GI) is still largely unexplored. In this paper, we evaluate the use of LLMs as mutation operators for GI to improve the search process. We expand the Gin Java GI toolkit to call OpenAI's API to generate edits for the JCodec tool. We randomly sample the space of edits using 5 different edit types. We find that the number of patches passing unit tests is up to 75% higher with LLM-based edits than with standard Insert edits. Further, we observe that the patches found with LLMs are generally less diverse compared to standard edits. We ran GI with local search to find runtime improvements. Although many improving patches are found by LLM-enhanced GI, the best improving patch was found by standard GI.
SEOct 24, 2025
GreenMalloc: Allocator Optimisation for Industrial WorkloadsAidan Dakhama, W. B. Langdon, Hector D. Menendez et al.
We present GreenMalloc, a multi objective search-based framework for automatically configuring memory allocators. Our approach uses NSGA II and rand_malloc as a lightweight proxy benchmarking tool. We efficiently explore allocator parameters from execution traces and transfer the best configurations to gem5, a large system simulator, in a case study on two allocators: the GNU C/CPP compiler's glibc malloc and Google's TCMalloc. Across diverse workloads, our empirical results show up to 4.1 percantage reduction in average heap usage without loss of runtime efficiency; indeed, we get a 0.25 percantage reduction.
10.8QUANT-PHMar 30
Toward Live Noise Fingerprinting in Quantum Software EngineeringAvner Bensoussan, Elena Chachkarova, Karine Even-Mendoza et al.
Contemporary quantum computers are inherently noisy, posing significant challenges for the development and testing of quantum software. Simplified or outdated noise assumptions can lead to incorrect assessments of program correctness, obscure debugging, and hinder cross-platform portability, creating a critical quantum software development gap. Providing accurate, practical noise characterisation is challenging as traditional reconstruction methods scale exponentially and rapidly become outdated. In this vision paper, we address this gap via a novel classical shadow tomography-based pipeline, SIMSHADOW, enabling efficient, continuously updatable noise fingerprinting from empirical observations, suitable for integration into quantum software development workflows, including testing and validation. We prototyped the pipeline to investigate fingerprints' ability to capture structured, interpretable noise and cross-platform discrepancies affecting quantum programs' behaviour to support realistic testing and debugging in future tools. Our evaluation with Qiskit and Cirq under widely used hardware-informed profiles, IBM Boston and Quantinuum H2, shows fingerprints exhibit channel-specific structure and yield interpretable heatmaps. We observed systematic cross-platform discrepancies under matched noise configurations, quantified by large Frobenius distances at a fraction of full tomography cost. On 69 MQTBENCH programs, larger fingerprint differences correlate with output distributions divergences, highlighting threats for testing and cross-platform debugging tasks.
QUANT-PHJun 20, 2025
Accelerating Quantum Eigensolver Algorithms With Machine LearningAvner Bensoussan, Elena Chachkarova, Karine Even-Mendoza et al.
In this paper, we explore accelerating Hamiltonian ground state energy calculation on NISQ devices. We suggest using search-based methods together with machine learning to accelerate quantum algorithms, exemplified in the Quantum Eigensolver use case. We trained two small models on classically mined data from systems with up to 16 qubits, using XGBoost's Python regressor. We evaluated our preliminary approach on 20-, 24- and 28-qubit systems by optimising the Eigensolver's hyperparameters. These models predict hyperparameter values, leading to a 0.12% reduction in error when tested on 28-qubit systems. However, due to inconclusive results with 20- and 24-qubit systems, we suggest further examination of the training data based on Hamiltonian characteristics. In future work, we plan to train machine learning models to optimise other aspects or subroutines of quantum algorithm execution beyond its hyperparameters.
55.0SEApr 29
Hot Fixing in the WildCarol Hanna, Karine Even-Mendoza, W. B. Langdon et al.
Despite the operational importance of hot fixes, large-scale evidence on how they reshape routine maintenance workflows, particularly in the era of autonomous coding agents, remains limited. We analyse hot fixes present in over 61,000 GitHub repositories from the Hao-Li/AIDev dataset and find consistent patterns of urgency: reduced collaboration (typically a single contributor), smaller and more targeted changes (median 2-3 commits and files, with <10 line modifications), limited review (often fewer than two reviewers), and substantially fewer test file modifications than regular bug fixes, consistent with their urgency-driven character. Leveraging the same urgency contexts, we examine differences between human- and AI-agent-authored hot fixes, revealing over 10 distinct repair behaviours, thus offering insights into future human-automation collaboration for hot fixing. Our study is the first to empirically analyse hot fix code changes at scale using a repository-level operationalisation of urgency. The comparison of human and agentbehaviours delineates their distinct characteristics, providing a foundation for understanding hot fixing in real-world practice
SEOct 5, 2025
GA4GC: Greener Agent for Greener Code via Multi-Objective Configuration OptimizationJingzhi Gong, Yixin Bian, Luis de la Cal et al.
Coding agents powered by LLMs face critical sustainability and scalability challenges in industrial deployment, with single runs consuming over 100k tokens and incurring environmental costs that may exceed optimization benefits. This paper introduces GA4GC, the first framework to systematically optimize coding agent runtime (greener agent) and code performance (greener code) trade-offs by discovering Pareto-optimal agent hyperparameters and prompt templates. Evaluation on the SWE-Perf benchmark demonstrates up to 135x hypervolume improvement, reducing agent runtime by 37.7% while improving correctness. Our findings establish temperature as the most critical hyperparameter, and provide actionable strategies to balance agent sustainability with code optimization effectiveness in industrial deployment.