75.4SEMay 26
EviACT: An Evidence-to-Action Framework for Agentic Program RepairQianru Meng, Xiao Zhang, Zhaochun Ren et al.
LLM-based agents have moved automated program repair (APR) from fixed-context patch generation to interactive repository-level repair. However, existing agentic APR systems still struggle to use execution evidence to guide localization, patch generation, and validation. We propose EviACT (Evidence-to-Action), an agentic APR framework that coordinates three evidence-driven guardrails across repair stages. The retrieval scaffold grounds repair context, the compile gate filters invalid edits, and the test-driven gate checks target-test recovery before full regression. Across four benchmarks, EviACT improves resolve rate over the strongest reported comparable baselines by 1.6-6.0 percentage points and shows 70.1-88.6% lower reported per-bug API cost where baseline costs are available. Ablations and diagnostics suggest that these gains are associated with the coordinated evidence-to-action chain, making agentic APR more effective and efficient.
47.3SEMar 24
When More Retrieval Hurts: Retrieval-Augmented Code Review GenerationQianru Meng, Xiao Zhang, Zhaochen Ren et al.
Code review generation can reduce developer effort by producing concise, reviewer-style feedback for a given code snippet or code change. However, generation-only models often produce generic or off-point reviews, while retrieval-only methods struggle to adapt well to new contexts. In this paper, we view retrieval augmentation for code review as retrieval-augmented in-context learning, where retrieved historical reviews are placed in the input context as examples that guide the model's output. Based on this view, we propose RARe (Retrieval-Augmented Code Reviewer), a framework that retrieves relevant historical reviews from a corpus and conditions a large language model on the retrieved in-context examples. Experiments on two public benchmarks show that RARe outperforms strong baselines and reaches BLEU-4 scores of 12.32 and 12.96. A key finding is that more retrieval can hurt: using only the top-1 retrieved example works best, while adding more retrieved items can degrade performance due to redundancy and conflicting cues under limited context budgets. Human evaluation and interpretability analysis further support that retrieval-augmented generation reduces generic outputs and improves review focus.
SEAug 11, 2020Code
GraphRepo: Fast Exploration in Software Repository MiningAlex Serban, Magiel Bruntink, Joost Visser
Mining and storage of data from software repositories is typically done on a per-project basis, where each project uses a unique combination of data schema, extraction tools, and (intermediate) storage infrastructure. We introduce GraphRepo, a tool that enables a unified approach to extract data from Git repositories, store it, and share it across repository mining projects. GraphRepo usesNeo4j, an ACID-compliant graph database management system, and allows modular plug-in of components for repository extraction (drillers), analysis (miners), and export (mappers). The graph enables a natural way to query the data by removing the need for data normalisation. GraphRepo is built in Python and offers multiple ways to interface with the rich Python ecosystem and with big data solutions. The schema of the graph database is generic and extensible. Using GraphRepo for software repository mining offers several advantages versus creating project-specific infrastructure: (i) high performance for short-iteration exploration and scalability to large data sets (ii) easy distribution of extracted data(e.g., for replication) or sharing of extracted data among projects, and (iii) extensibility and interoperability. A set of benchmarks on four open source projects demonstrate that GraphRepo allows very fast querying of repository data, once extracted and indexed. More information can be found in the project's documentation (available at https://tinyurl.com/grepodoc) and in the project's repository (available at https://tinyurl.com/grrepo). A video demonstration isalso available online (https://tinyurl.com/grrepov)
LGMay 26, 2021
Deep Repulsive Prototypes for Adversarial RobustnessAlex Serban, Erik Poll, Joost Visser
While many defences against adversarial examples have been proposed, finding robust machine learning models is still an open problem. The most compelling defence to date is adversarial training and consists of complementing the training data set with adversarial examples. Yet adversarial training severely impacts training time and depends on finding representative adversarial samples. In this paper we propose to train models on output spaces with large class separation in order to gain robustness without adversarial training. We introduce a method to partition the output space into class prototypes with large separation and train models to preserve it. Experimental results shows that models trained with these prototypes -- which we call deep repulsive prototypes -- gain robustness competitive with adversarial training, while also preserving more accuracy on natural samples. Moreover, the models are more resilient to large perturbation sizes. For example, we obtained over 50% robustness for CIFAR-10, with 92% accuracy on natural samples and over 20% robustness for CIFAR-100, with 71% accuracy on natural samples without adversarial training. For both data sets, the models preserved robustness against large perturbations better than adversarially trained models.
SEMay 26, 2021
Adapting Software Architectures to Machine Learning ChallengesAlex Serban, Joost Visser
Unique developmental and operational characteristics of ML components as well as their inherent uncertainty demand robust engineering principles are used to ensure their quality. We aim to determine how software systems can be (re-) architected to enable robust integration of ML components. Towards this goal, we conducted a mixed-methods empirical study consisting of (i) a systematic literature review to identify the challenges and their solutions in software architecture for ML, (ii) semi-structured interviews with practitioners to qualitatively complement the initial findings and (iii) a survey to quantitatively validate the challenges and their solutions. We compiled and validated twenty challenges and solutions for (re-) architecting systems with ML components. Our results indicate, for example, that traditional software architecture challenges (e.g., component coupling) also play an important role when using ML components; along with new ML specific challenges (e.g., the need for continuous retraining). Moreover, the results indicate that ML heightened decision drivers, such as privacy, play a marginal role compared to traditional decision drivers, such as scalability. Using the survey we were able to establish a link between architectural solutions and software quality attributes, which enabled us to provide twenty architectural tactics used to satisfy individual quality requirements of systems with ML components. Altogether, the results of the study can be interpreted as an empirical framework that supports the process of (re-) architecting software systems with ML components.
SEMar 1, 2021
Practices for Engineering Trustworthy Machine Learning ApplicationsAlex Serban, Koen van der Blom, Holger Hoos et al.
Following the recent surge in adoption of machine learning (ML), the negative impact that improper use of ML can have on users and society is now also widely recognised. To address this issue, policy makers and other stakeholders, such as the European Commission or NIST, have proposed high-level guidelines aiming to promote trustworthy ML (i.e., lawful, ethical and robust). However, these guidelines do not specify actions to be taken by those involved in building ML systems. In this paper, we argue that guidelines related to the development of trustworthy ML can be translated to operational practices, and should become part of the ML development life cycle. Towards this goal, we ran a multi-vocal literature review, and mined operational practices from white and grey literature. Moreover, we launched a global survey to measure practice adoption and the effects of these practices. In total, we identified 14 new practices, and used them to complement an existing catalogue of ML engineering practices. Initial analysis of the survey results reveals that so far, practice adoption for trustworthy ML is relatively low. In particular, practices related to assuring security of ML components have very low adoption. Other practices enjoy slightly larger adoption, such as providing explanations to users. Our extended practice catalogue can be used by ML development teams to bridge the gap between high-level guidelines and actual development of trustworthy ML systems; it is open for review and contribution
LGAug 12, 2020
Learning to Learn from Mistakes: Robust Optimization for Adversarial NoiseAlex Serban, Erik Poll, Joost Visser
Sensitivity to adversarial noise hinders deployment of machine learning algorithms in security-critical applications. Although many adversarial defenses have been proposed, robustness to adversarial noise remains an open problem. The most compelling defense, adversarial training, requires a substantial increase in processing time and it has been shown to overfit on the training data. In this paper, we aim to overcome these limitations by training robust models in low data regimes and transfer adversarial knowledge between different models. We train a meta-optimizer which learns to robustly optimize a model using adversarial examples and is able to transfer the knowledge learned to new models, without the need to generate new adversarial examples. Experimental results show the meta-optimizer is consistent across different architectures and data sets, suggesting it is possible to automatically patch adversarial vulnerabilities.
SEAug 7, 2020
Towards Using Probabilistic Models to Design Software Systems with Inherent UncertaintyAlex Serban, Erik Poll, Joost Visser
The adoption of machine learning (ML) components in software systems raises new engineering challenges. In particular, the inherent uncertainty regarding functional suitability and the operation environment makes architecture evaluation and trade-off analysis difficult. We propose a software architecture evaluation method called Modeling Uncertainty During Design (MUDD) that explicitly models the uncertainty associated to ML components and evaluates how it propagates through a system. The method supports reasoning over how architectural patterns can mitigate uncertainty and enables comparison of different architectures focused on the interplay between ML and classical software components. While our approach is domain-agnostic and suitable for any system where uncertainty plays a central role, we demonstrate our approach using as example a perception system for autonomous driving.
CVAug 7, 2020
Adversarial Examples on Object Recognition: A Comprehensive SurveyAlex Serban, Erik Poll, Joost Visser
Deep neural networks are at the forefront of machine learning research. However, despite achieving impressive performance on complex tasks, they can be very sensitive: Small perturbations of inputs can be sufficient to induce incorrect behavior. Such perturbations, called adversarial examples, are intentionally designed to test the network's sensitivity to distribution drifts. Given their surprisingly small size, a wide body of literature conjectures on their existence and how this phenomenon can be mitigated. In this article we discuss the impact of adversarial examples on security, safety, and robustness of neural networks. We start by introducing the hypotheses behind their existence, the methods used to construct or protect against them, and the capacity to transfer adversarial examples between different machine learning models. Altogether, the goal is to provide a comprehensive and self-contained survey of this growing field of research.
SEJul 28, 2020
Adoption and Effects of Software Engineering Best Practices in Machine LearningAlex Serban, Koen van der Blom, Holger Hoos et al.
The increasing reliance on applications with machine learning (ML) components calls for mature engineering techniques that ensure these are built in a robust and future-proof manner. We aim to empirically determine the state of the art in how teams develop, deploy and maintain software with ML components. We mined both academic and grey literature and identified 29 engineering best practices for ML applications. We conducted a survey among 313 practitioners to determine the degree of adoption for these practices and to validate their perceived effects. Using the survey responses, we quantified practice adoption, differentiated along demographic characteristics, such as geography or team size. We also tested correlations and investigated linear and non-linear relationships between practices and their perceived effect using various statistical models. Our findings indicate, for example, that larger teams tend to adopt more practices, and that traditional software engineering practices tend to have lower adoption than ML specific practices. Also, the statistical models can accurately predict perceived effects such as agility, software quality and traceability, from the degree of adoption for specific sets of practices. Combining practice adoption rates with practice importance, as revealed by statistical models, we identify practices that are important but have low adoption, as well as practices that are widely adopted but are less important for the effects we studied. Overall, our survey and the analysis of responses received provide a quantitative basis for assessment and step-wise improvement of practice adoption by ML teams.
CVFeb 7, 2019
StampNet: unsupervised multi-class object discoveryJoost Visser, Alessandro Corbetta, Vlado Menkovski et al.
Unsupervised object discovery in images involves uncovering recurring patterns that define objects and discriminates them against the background. This is more challenging than image clustering as the size and the location of the objects are not known: this adds additional degrees of freedom and increases the problem complexity. In this work, we propose StampNet, a novel autoencoding neural network that localizes shapes (objects) over a simple background in images and categorizes them simultaneously. StampNet consists of a discrete latent space that is used to categorize objects and to determine the location of the objects. The object categories are formed during the training, resulting in the discovery of a fixed set of objects. We present a set of experiments that demonstrate that StampNet is able to localize and cluster multiple overlapping shapes with varying complexity including the digits from the MNIST dataset. We also present an application of StampNet in the localization of pedestrians in overhead depth-maps.
CVOct 2, 2018
Adversarial Examples - A Complete Characterisation of the PhenomenonAlexandru Constantin Serban, Erik Poll, Joost Visser
We provide a complete characterisation of the phenomenon of adversarial examples - inputs intentionally crafted to fool machine learning models. We aim to cover all the important concerns in this field of study: (1) the conjectures on the existence of adversarial examples, (2) the security, safety and robustness implications, (3) the methods used to generate and (4) protect against adversarial examples and (5) the ability of adversarial examples to transfer between different machine learning models. We provide ample background information in an effort to make this document self-contained. Therefore, this document can be used as survey, tutorial or as a catalog of attacks and defences using adversarial examples.
SEJan 22, 2017
The Influence of Teamwork Quality on Software Team PerformanceEmily Weimar, Ariadi Nugroho, Joost Visser et al.
Traditionally, software quality is thought to depend on sound software engineering and development methodologies such as structured programming and agile development. However, high quality software depends just as much on high quality collaboration within the team. Since the success rate of software development projects is low (Wateridge, 1995; The Standish Group, 2009), it is important to understand which characteristics of interactions within software development teams significantly influence performance. Hoegl and Gemuenden (2001) reported empirical evidence for the relation between teamwork quality and software quality, using a six-factor teamwork quality (TWQ) model. This article extends the work of Hoegl and Gemuenden (2001) with the aim of finding additional factors that may influence software team performance. We introduce three new TWQ factors: trust, value sharing, and coordination of expertise. The relationship between TWQ and team performance and the improvement of the model are tested using data from 252 team members and stakeholders. Results show that teamwork quality is significantly related to team performance, as rated by both team members and stakeholders: TWQ explains 81% of the variance of team performance as rated by team members and 61% as rated by stakeholders. This study shows that trust, shared values, and coordination of expertise are important factors for team leaders to consider in order to achieve high quality software team work.