David Williams

SE
h-index6
12papers
79citations
Novelty39%
AI Score51

12 Papers

SEMar 11Code
Unveiling Practical Shortcomings of Patch Overfitting Detection Techniques

David Williams, Ioakim Avraam, Aldeida Aleti et al.

Automated Program Repair (APR) can reduce the time developers spend debugging, allowing them to focus on other aspects of software development. Automatically generated bug patches are typically validated through software testing. However, this method can lead to patch overfitting, i.e., generating patches that pass the given tests but are still incorrect. Patch correctness assessment (also known as overfitting detection) techniques have been proposed to identify patches that overfit. However, prior work often assessed the effectiveness of these techniques in isolation and on datasets that do not reflect the distribution of correct-to-overfitting patches that would be generated by APR tools in typical use; thus, we still do not know their effectiveness in practice. This work presents the first comprehensive benchmarking study of several patch overfitting detection (POD) methods in a practical scenario. To this end, we curate datasets that reflect realistic assumptions (i.e., patches produced by tools run under the same experimental conditions). Next, we use these data to benchmark six state-of-the-art POD approaches -- spanning static analysis, dynamic testing, and learning-based approaches -- against two baselines based on random sampling (one from prior work and one proposed herein). Our results are striking: Simple random selection outperforms all POD tools for 71% to 96% of cases, depending on the POD tool. This suggests two main takeaways: (1) current POD tools offer limited practical benefit, highlighting the need for novel techniques; (2) any POD tool must be benchmarked on realistic data and against random sampling to prove its practical effectiveness. To this end, we encourage the APR community to continue improving POD techniques and to adopt our proposed methodology for practical benchmarking; we make our data and code available to facilitate such adoption.

SEMay 7
Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance

Giovanni Pinna, Jingzhi Gong, David Williams et al.

The rapid adoption of AI-powered coding assistants is transforming software development practices, yet systematic comparisons of their effectiveness across different task types and over time remain limited. This paper presents an empirical study comparing five popular agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code), analyzing 7,156 pull requests (PRs) from the AIDev dataset. Temporal trend analysis reveals heterogeneous evolution patterns: Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks), whereas other agents remain largely stable. Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks. OpenAI Codex achieves consistently high acceptance rates across all nine task categories (59.6%-88.6%), with stratified Chi-square tests confirming statistically significant advantages over other agents in several task categories. However, no single agent performs best across all task types: Claude Code leads in documentation (92.3%) and features (72.6%), while Cursor excels in fix tasks (80.4%).

CLSep 25, 2024
Adaptive Self-Supervised Learning Strategies for Dynamic On-Device LLM Personalization

Rafael Mendoza, Isabella Cruz, Richard Liu et al.

Large language models (LLMs) have revolutionized how we interact with technology, but their personalization to individual user preferences remains a significant challenge, particularly in on-device applications. Traditional methods often depend heavily on labeled datasets and can be resource-intensive. To address these issues, we present Adaptive Self-Supervised Learning Strategies (ASLS), which utilizes self-supervised learning techniques to personalize LLMs dynamically. The framework comprises a user profiling layer for collecting interaction data and a neural adaptation layer for real-time model fine-tuning. This innovative approach enables continuous learning from user feedback, allowing the model to generate responses that align closely with user-specific contexts. The adaptive mechanisms of ASLS minimize computational demands and enhance personalization efficiency. Experimental results across various user scenarios illustrate the superior performance of ASLS in boosting user engagement and satisfaction, highlighting its potential to redefine LLMs as highly responsive and context-aware systems on-device.

SEMay 8
SafeTune: Search-based Harmfulness Minimisation for Large Language Models

Giordano d'Aloisio, David Williams, Giusy Annunziata et al.

The widespread adoption of Large Language Models (LLMs) raises concerns about the potential harmfulness of their responses. In this paper, we first investigate the harmfulness of responses from four general-purpose LLMs. Next, we propose SafeTune, a multi-objective search-based approach to mitigate harmfulness while increasing response relevance through hyperparameter tuning and system prompt engineering. Our initial evaluation shows that SafeTune significantly reduces the rate of harmful responses generated by Qwen3.5 0.8B and increases prompt-response relevance (both with a large effect size). Among the parameters we explore, we also find that encouraging greater repetition in responses is most impactful in reducing harmfulness while increasing relevance.

SEMar 31
BayesInsights: Modelling Software Delivery and Developer Experience with Bayesian Networks at Bloomberg

Serkan Kirbas, Federica Sarro, David Williams

As software in industry grows in size and complexity, so does the volume of engineering data that companies generate and use. Ideally, this data could be used for many purposes, including informing decisions on engineering priorities. However, without a structured representation of the links between different aspects of software development, companies can struggle to identify the root causes of deficiencies or anticipate the effects of changes. In this paper, we report on our experience at Bloomberg in developing a novel tool, dubbed BayesInsights, which provides an interactive interface for visualising causal dependencies across various aspects of the software engineering (SE) process using Bayesian Networks (BNs). We describe our journey from defining network structures using a combination of established literature, expert insight, and structure learning algorithms, to integrating BayesInsights into existing data analytics solutions, and conclude with a mixed-methods evaluation of performance benchmarking and survey responses from 24 senior practitioners at Bloomberg. Our results revealed 95.8% of participants found the tool useful for identifying software delivery challenges at the team and organisational levels, cementing its value as a proof of concept for modelling software delivery and developer experience. BayesInsights is currently in preview, with access granted to seven engineering teams and a wider deployment roadmap in place for the future.

AIJan 17, 2025
Exploring the Impact of Generative Artificial Intelligence in Education: A Thematic Analysis

Abhishek Kaushik, Sargam Yadav, Andrew Browne et al.

The recent advancements in Generative Artificial intelligence (GenAI) technology have been transformative for the field of education. Large Language Models (LLMs) such as ChatGPT and Bard can be leveraged to automate boilerplate tasks, create content for personalised teaching, and handle repetitive tasks to allow more time for creative thinking. However, it is important to develop guidelines, policies, and assessment methods in the education sector to ensure the responsible integration of these tools. In this article, thematic analysis has been performed on seven essays obtained from professionals in the education sector to understand the advantages and pitfalls of using GenAI models such as ChatGPT and Bard in education. Exploratory Data Analysis (EDA) has been performed on the essays to extract further insights from the text. The study found several themes which highlight benefits and drawbacks of GenAI tools, as well as suggestions to overcome these limitations and ensure that students are using these tools in a responsible and ethical manner.

SEOct 5, 2025
GA4GC: Greener Agent for Greener Code via Multi-Objective Configuration Optimization

Jingzhi Gong, Yixin Bian, Luis de la Cal et al.

Coding agents powered by LLMs face critical sustainability and scalability challenges in industrial deployment, with single runs consuming over 100k tokens and incurring environmental costs that may exceed optimization benefits. This paper introduces GA4GC, the first framework to systematically optimize coding agent runtime (greener agent) and code performance (greener code) trade-offs by discovering Pareto-optimal agent hyperparameters and prompt templates. Evaluation on the SWE-Perf benchmark demonstrates up to 135x hypervolume improvement, reducing agent runtime by 37.7% while improving correctness. Our findings establish temperature as the most critical hyperparameter, and provide actionable strategies to balance agent sustainability with code optimization effectiveness in industrial deployment.

CVMar 1, 2021
Fool Me Once: Robust Selective Segmentation via Out-of-Distribution Detection with Contrastive Learning

David Williams, Matthew Gadd, Daniele De Martini et al.

In this work, we train a network to simultaneously perform segmentation and pixel-wise Out-of-Distribution (OoD) detection, such that the segmentation of unknown regions of scenes can be rejected. This is made possible by leveraging an OoD dataset with a novel contrastive objective and data augmentation scheme. By combining data including unknown classes in the training data, a more robust feature representation can be learned with known classes represented distinctly from those unknown. When presented with unknown classes or conditions, many current approaches for segmentation frequently exhibit high confidence in their inaccurate segmentations and cannot be trusted in many operational environments. We validate our system on a real-world dataset of unusual driving scenes, and show that by selectively segmenting scenes based on what is predicted as OoD, we can increase the segmentation accuracy by an IoU of 0.2 with respect to alternative techniques.

ROMay 11, 2020
Keep off the Grass: Permissible Driving Routes from Radar with Weak Audio Supervision

David Williams, Daniele De Martini, Matthew Gadd et al.

Reliable outdoor deployment of mobile robots requires the robust identification of permissible driving routes in a given environment. The performance of LiDAR and vision-based perception systems deteriorates significantly if certain environmental factors are present e.g. rain, fog, darkness. Perception systems based on FMCW scanning radar maintain full performance regardless of environmental conditions and with a longer range than alternative sensors. Learning to segment a radar scan based on driveability in a fully supervised manner is not feasible as labelling each radar scan on a bin-by-bin basis is both difficult and time-consuming to do by hand. We therefore weakly supervise the training of the radar-based classifier through an audio-based classifier that is able to predict the terrain type underneath the robot. By combining odometry, GPS and the terrain labels from the audio classifier, we are able to construct a terrain labelled trajectory of the robot in the environment which is then used to label the radar scans. Using a curriculum learning procedure, we then train a radar segmentation network to generalise beyond the initial labelling and to detect all permissible driving routes in the environment.

IVSep 13, 2019
Coupling Rendering and Generative Adversarial Networks for Artificial SAS Image Generation

Albert Reed, Isaac Gerg, John McKay et al.

Acquisition of Synthetic Aperture Sonar (SAS) datasets is bottlenecked by the costly deployment of SAS imaging systems, and even when data acquisition is possible,the data is often skewed towards containing barren seafloor rather than objects of interest. We present a novel pipeline, called SAS GAN, which couples an optical renderer with a generative adversarial network (GAN) to synthesize realistic SAS images of targets on the seafloor. This coupling enables high levels of SAS image realism while enabling control over image geometry and parameters. We demonstrate qualitative results by presenting examples of images created with our pipeline. We also present quantitative results through the use of t-SNE and the Fréchet Inception Distance to argue that our generated SAS imagery potentially augments SAS datasets more effectively than an off-the-shelf GAN.

CVAug 8, 2018
Additional Representations for Improving Synthetic Aperture Sonar Classification Using Convolutional Neural Networks

Isaac Gerg, David Williams

Object classification in synthetic aperture sonar (SAS) imagery is usually a data starved and class imbalanced problem. There are few objects of interest present among much benign seafloor. Despite these problems, current classification techniques discard a large portion of the collected SAS information. In particular, a beamformed SAS image, which we call a single-look complex (SLC) image, contains complex pixels composed of real and imaginary parts. For human consumption, the SLC is converted to a magnitude-phase representation and the phase information is discarded. Even more problematic, the magnitude information usually exhibits a large dynamic range (>80dB) and must be dynamic range compressed for human display. Often it is this dynamic range compressed representation, originally designed for human consumption, which is fed into a classifier. Consequently, the classification process is completely void of the phase information. In this work, we show improvements in classification performance using the phase information from the SLC as well as information from an alternate source: photographs. We perform statistical testing to demonstrate the validity of our results.

SEJun 25, 2014
Managing LTL properties in Event-B refinement

Steve Schneider, Helen Treharne, Heike Wehrheim et al.

Refinement in Event-B supports the development of systems via proof based step-wise refinement of events. This refinement approach ensures safety properties are preserved, but additional reasoning is required in order to establish liveness and fairness properties. In this paper we present results which allow a closer integration of two formal methods, Event-B and linear temporal logic. In particular we show how a class of temporal logic properties can carry through a refinement chain of machines. Refinement steps can include introduction of new events, event renaming and event splitting. We also identify a general liveness property that holds for the events of the initial system of a refinement chain. The approach will aid developers in enabling them to verify linear temporal logic properties at early stages of a development, knowing they will be preserved at later stages. We illustrate the results via a simple case study.