75.8SEMay 7
Evaluating Non-English Developer Support in Machine Learning for Software EngineeringJonathan Katzy, Yongcheng Huang, Gopal-Raj Panchu et al.
Large Language Models are increasingly used in software engineering, but both code generation and its evaluation remain predominantly English-centric. This leaves a major gap in our understanding of how well current tools support multilingual development, where code contains non-English natural language. In this paper, we investigate non-English code comment generation and the reliability of current methods for evaluating such outputs. We evaluate five code LLMs (CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2) across five natural languages: Dutch, English, Greek, Polish and Chinese. We further conduct an open-coding study of 12,500 generated comments, from which we derive a publicly released human-annotated dataset and a taxonomy of 26 error types. We use these human annotations, to evaluate the performance of neural metrics, and LLM-as-a-judge pipelines. Our findings show that generative performance deteriorates substantially outside English, with linguistic errors increasing by up to 15.1$\times$, alongside frequent incoherent generations and a rise in semantic errors. More critically, we show that detecting errors in non-English comments underperforms. Across classical overlap-based metrics, off-the-shelf neural metrics, extended neural metrics using newer multilingual, language-specific, and code-specific models, and LLM-as-a-judge pipelines, no automatic approach provides reliable and consistent assessment. Neural metrics fail to distinguish correct comments from incorrect outputs or even random noise, and tend to overestimate quality in non-English settings. LLM-as-a-judge methods achieve the highest agreement with human annotations but fail to reliably capture important language-related and semantic errors. Overall, our results show that evaluation and generation are key barriers for multilingual tooling, and that human judgment remains indispensable.
CRJan 4, 2023
Extending Source Code Pre-Trained Language Models to Summarise Decompiled BinariesAli Al-Kaswan, Toufique Ahmed, Maliheh Izadi et al.
Reverse engineering binaries is required to understand and analyse programs for which the source code is unavailable. Decompilers can transform the largely unreadable binaries into a more readable source code-like representation. However, reverse engineering is time-consuming, much of which is taken up by labelling the functions with semantic information. While the automated summarisation of decompiled code can help Reverse Engineers understand and analyse binaries, current work mainly focuses on summarising source code, and no suitable dataset exists for this task. In this work, we extend large pre-trained language models of source code to summarise decompiled binary functions. Furthermore, we investigate the impact of input and data properties on the performance of such models. Our approach consists of two main components; the data and the model. We first build CAPYBARA, a dataset of 214K decompiled function-documentation pairs across various compiler optimisations. We extend CAPYBARA further by generating synthetic datasets and deduplicating the data. Next, we fine-tune the CodeT5 base model with CAPYBARA to create BinT5. BinT5 achieves the state-of-the-art BLEU-4 score of 60.83, 58.82, and 44.21 for summarising source, decompiled, and synthetically stripped decompiled code, respectively. This indicates that these models can be extended to decompiled binaries successfully. Finally, we found that the performance of BinT5 is not heavily dependent on the dataset size and compiler optimisation level. We recommend future research to further investigate transferring knowledge when working with less expressive input formats such as stripped binaries.
LGAug 16, 2023Code
Endogenous Macrodynamics in Algorithmic RecoursePatrick Altmeyer, Giovan Angela, Aleksander Buszydlik et al.
Existing work on Counterfactual Explanations (CE) and Algorithmic Recourse (AR) has largely focused on single individuals in a static environment: given some estimated model, the goal is to find valid counterfactuals for an individual instance that fulfill various desiderata. The ability of such counterfactuals to handle dynamics like data and model drift remains a largely unexplored research challenge. There has also been surprisingly little work on the related question of how the actual implementation of recourse by one individual may affect other individuals. Through this work, we aim to close that gap. We first show that many of the existing methodologies can be collectively described by a generalized framework. We then argue that the existing framework does not account for a hidden external cost of recourse, that only reveals itself when studying the endogenous dynamics of recourse at the group level. Through simulation experiments involving various state-of the-art counterfactual generators and several benchmark datasets, we generate large numbers of counterfactuals and study the resulting domain and model shifts. We find that the induced shifts are substantial enough to likely impede the applicability of Algorithmic Recourse in some situations. Fortunately, we find various strategies to mitigate these concerns. Our simulation framework for studying recourse dynamics is fast and opensourced.
SEMar 25, 2022
Code Smells for Machine Learning ApplicationsHaiyin Zhang, Luís Cruz, Arie van Deursen
The popularity of machine learning has wildly expanded in recent years. Machine learning techniques have been heatedly studied in academia and applied in the industry to create business value. However, there is a lack of guidelines for code quality in machine learning applications. In particular, code smells have rarely been studied in this domain. Although machine learning code is usually integrated as a small part of an overarching system, it usually plays an important role in its core functionality. Hence ensuring code quality is quintessential to avoid issues in the long run. This paper proposes and identifies a list of 22 machine learning-specific code smells collected from various sources, including papers, grey literature, GitHub commits, and Stack Overflow posts. We pinpoint each smell with a description of its context, potential issues in the long run, and proposed solutions. In addition, we link them to their respective pipeline stage and the evidence from both academic and grey literature. The code smell catalog helps data scientists and developers produce and maintain high-quality machine learning application code.
SEMar 15, 2022
Data Smells in Public DatasetsArumoy Shome, Luis Cruz, Arie van Deursen
The adoption of Artificial Intelligence (AI) in high-stakes domains such as healthcare, wildlife preservation, autonomous driving and criminal justice system calls for a data-centric approach to AI. Data scientists spend the majority of their time studying and wrangling the data, yet tools to aid them with data analysis are lacking. This study identifies the recurrent data quality issues in public datasets. Analogous to code smells, we introduce a novel catalogue of data smells that can be used to indicate early signs of problems or technical debt in machine learning systems. To understand the prevalence of data quality issues in datasets, we analyse 25 public datasets and identify 14 data smells.
CLApr 24, 2023
Enriching Source Code with Contextual Data for Code Completion Models: An Empirical StudyTim van Dam, Maliheh Izadi, Arie van Deursen
Transformer-based pre-trained models have recently achieved great results in solving many software engineering tasks including automatic code completion which is a staple in a developer's toolkit. While many have striven to improve the code-understanding abilities of such models, the opposite -- making the code easier to understand -- has not been properly investigated. In this study, we aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion. We consider type annotations and comments as two common forms of additional contextual information that often help developers understand code better. For the experiments, we study code completion in two granularity levels; token and line completion and take three recent and large-scale language models for source code: UniXcoder, CodeGPT, and InCoder with five evaluation metrics. Finally, we perform the Wilcoxon Signed Rank test to gauge significance and measure the effect size. Contrary to our expectations, all models perform better if type annotations are removed (albeit the effect sizes are small). For comments, we find that the models perform better in the presence of multi-line comments (again with small effect sizes). Based on our observations, we recommend making proper design choices when training, fine-tuning, or simply selecting such models given the intended data and application. Better evaluations and multi-modal techniques can also be further investigated to improve the practicality and accuracy of auto-completions.
SEFeb 25, 2023
STACC: Code Comment Classification using SentenceTransformersAli Al-Kaswan, Maliheh Izadi, Arie van Deursen
Code comments are a key resource for information about software artefacts. Depending on the use case, only some types of comments are useful. Thus, automatic approaches to classify these comments have been proposed. In this work, we address this need by proposing, STACC, a set of SentenceTransformers-based binary classifiers. These lightweight classifiers are trained and tested on the NLBSE Code Comment Classification tool competition dataset, and surpass the baseline by a significant margin, achieving an average F1 score of 0.74 against the baseline of 0.31, which is an improvement of 139%. A replication package, as well as the models themselves, are publicly available.
LGNov 23, 2022
Are Concept Drift Detectors Reliable Alarming Systems? -- A Comparative StudyLorena Poenaru-Olaru, Luis Cruz, Arie van Deursen et al.
As machine learning models increasingly replace traditional business logic in the production system, their lifecycle management is becoming a significant concern. Once deployed into production, the machine learning models are constantly evaluated on new streaming data. Given the continuous data flow, shifting data, also known as concept drift, is ubiquitous in such settings. Concept drift usually impacts the performance of machine learning models, thus, identifying the moment when concept drift occurs is required. Concept drift is identified through concept drift detectors. In this work, we assess the reliability of concept drift detectors to identify drift in time by exploring how late are they reporting drifts and how many false alarms are they signaling. We compare the performance of the most popular drift detectors belonging to two different concept drift detector groups, error rate-based detectors and data distribution-based detectors. We assess their performance on both synthetic and real-world data. In the case of synthetic data, we investigate the performance of detectors to identify two types of concept drift, abrupt and gradual. Our findings aim to help practitioners understand which drift detector should be employed in different situations and, to achieve this, we share a list of the most important observations made throughout this study, which can serve as guidelines for practical usage. Furthermore, based on our empirical results, we analyze the suitability of each concept drift detection group to be used as alarming system.
CLFeb 13, 2023
Targeted Attack on GPT-Neo for the SATML Language Model Data Extraction ChallengeAli Al-Kaswan, Maliheh Izadi, Arie van Deursen
Previous work has shown that Large Language Models are susceptible to so-called data extraction attacks. This allows an attacker to extract a sample that was contained in the training data, which has massive privacy implications. The construction of data extraction attacks is challenging, current attacks are quite inefficient, and there exists a significant gap in the extraction capabilities of untargeted attacks and memorization. Thus, targeted attacks are proposed, which identify if a given sample from the training data, is extractable from a model. In this work, we apply a targeted data extraction attack to the SATML2023 Language Model Training Data Extraction Challenge. We apply a two-step approach. In the first step, we maximise the recall of the model and are able to extract the suffix for 69% of the samples. In the second step, we use a classifier-based Membership Inference Attack on the generations. Our AutoSklearn classifier achieves a precision of 0.841. The full approach reaches a score of 0.405 recall at a 10% false positive rate, which is an improvement of 34% over the baseline of 0.301.
SEAug 25, 2023
On the Impact of Language Selection for Training and Evaluating Programming Language ModelsJonathan Katzy, Maliheh Izadi, Arie van Deursen
The recent advancements in Transformer-based Language Models have demonstrated significant potential in enhancing the multilingual capabilities of these models. The remarkable progress made in this domain not only applies to natural language tasks but also extends to the domain of programming languages. Despite the ability of these models to learn from multiple languages, evaluations typically focus on particular combinations of the same languages. In this study, we evaluate the similarity of programming languages by analyzing their representations using a CodeBERT-based model. Our experiments reveal that token representation in languages such as C++, Python, and Java exhibit proximity to one another, whereas the same tokens in languages such as Mathematica and R display significant dissimilarity. Our findings suggest that this phenomenon can potentially result in performance challenges when dealing with diverse languages. Thus, we recommend using our similarity measure to select a diverse set of programming languages when training and evaluating future models.
SENov 1, 2022
An Empirical Study on Data Leakage and Generalizability of Link Prediction Models for Issues and CommitsMaliheh Izadi, Pooya Rostami Mazrae, Tom Mens et al.
To enhance documentation and maintenance practices, developers conventionally establish links between related software artifacts manually. Empirical research has revealed that developers frequently overlook this practice, resulting in significant information loss. To address this issue, automatic link recovery techniques have been proposed. However, these approaches primarily focused on improving prediction accuracy on randomly-split datasets, with limited attention given to the impact of data leakage and the generalizability of the predictive models. LinkFormer seeks to address these limitations. Our approach not only preserves and improves the accuracy of existing predictions but also enhances their alignment with real-world settings and their generalizability. First, to better utilize contextual information for prediction, we employ the Transformer architecture and fine-tune multiple pre-trained models on both textual and metadata information of issues and commits. Next, to gauge the effect of time on model performance, we employ two splitting policies during both the training and testing phases; randomly- and temporally-split datasets. Finally, in pursuit of a generic model that can demonstrate high performance across a range of projects, we undertake additional fine-tuning of LinkFormer within two distinct transfer-learning settings. Our findings support that to simulate real-world scenarios effectively, researchers must maintain the temporal flow of data when training models. Furthermore, the results demonstrate that LinkFormer outperforms existing methodologies by a significant margin, achieving a 48% improvement in F1-measure within a project-based setting. Finally, the performance of LinkFormer in the cross-project setting is comparable to its average performance within the project-based scenario.
16.6SEApr 15
Learning from Change: Predictive Models for Incident Prevention in a Regulated IT EnvironmentEileen Kapel, Jan Lennartz, Luis Cruz et al.
Effective IT change management is important for businesses that depend on software and services, particularly in highly regulated sectors such as finance, where operational reliability, auditability, and explainability are essential. A significant portion of IT incidents are caused by changes, making it important to identify high-risk changes before deployment. This study presents a predictive incident risk scoring approach at a large international bank. The approach supports engineers during the assessment and planning phases of change deployments by predicting the potential of inducing incidents. To satisfy regulatory constraints, we built the model with auditability and explainability in mind, applying SHAP values to provide feature-level insights and ensure decisions are traceable and transparent. Using a one-year real-world dataset, we compare the existing rule-based process with three machine learning models: HGBC, LightGBM, and XGBoost. LightGBM achieved the best performance, particularly when enriched with aggregated team metrics that capture organisational context. Our results show that data-driven, interpretable models can outperform rule-based approaches while meeting compliance needs, enabling proactive risk mitigation and more reliable IT operations.
SEJan 15
Model See, Model Do? Exposure-Aware Evaluation of Bug-vs-Fix Preference in Code LLMsAli Al-Kaswan, Claudio Spiess, Prem Devanbu et al.
Large language models are increasingly used for code generation and debugging, but their outputs can still contain bugs, that originate from training data. Distinguishing whether an LLM prefers correct code, or a familiar incorrect version might be influenced by what it's been exposed to during training. We introduce an exposure-aware evaluation framework that quantifies how prior exposure to buggy versus fixed code influences a model's preference. Using the ManySStuBs4J benchmark, we apply Data Portraits for membership testing on the Stack-V2 corpus to estimate whether each buggy and fixed variant was seen during training. We then stratify examples by exposure and compare model preference using code completion as well as multiple likelihood-based scoring metrics We find that most examples (67%) have neither variant in the training data, and when only one is present, fixes are more frequently present than bugs. In model generations, models reproduce buggy lines far more often than fixes, with bug-exposed examples amplifying this tendency and fix-exposed examples showing only marginal improvement. In likelihood scoring, minimum and maximum token-probability metrics consistently prefer the fixed code across all conditions, indicating a stable bias toward correct fixes. In contrast, metrics like the Gini coefficient reverse preference when only the buggy variant was seen. Our results indicate that exposure can skew bug-fix evaluations and highlight the risk that LLMs may propagate memorised errors in practice.
LGMar 24, 2023
Uncovering Energy-Efficient Practices in Deep Learning Training: Preliminary Steps Towards Green AITim Yarally, Luís Cruz, Daniel Feitosa et al.
Modern AI practices all strive towards the same goal: better results. In the context of deep learning, the term "results" often refers to the achieved accuracy on a competitive problem set. In this paper, we adopt an idea from the emerging field of Green AI to consider energy consumption as a metric of equal importance to accuracy and to reduce any irrelevant tasks or energy usage. We examine the training stage of the deep learning pipeline from a sustainability perspective, through the study of hyperparameter tuning strategies and the model complexity, two factors vastly impacting the overall pipeline's energy consumption. First, we investigate the effectiveness of grid search, random search and Bayesian optimisation during hyperparameter tuning, and we find that Bayesian optimisation significantly dominates the other strategies. Furthermore, we analyse the architecture of convolutional neural networks with the energy consumption of three prominent layer types: convolutional, linear and ReLU layers. The results show that convolutional layers are the most computationally expensive by a strong margin. Additionally, we observe diminishing returns in accuracy for more energy-hungry models. The overall energy consumption of training can be halved by reducing the network complexity. In conclusion, we highlight innovative and promising energy-efficient practices for training deep learning models. To expand the application of Green AI, we advocate for a shift in the design of deep learning models, by considering the trade-off between energy efficiency and accuracy.
LGAug 14, 2023
Explaining Black-Box Models through CounterfactualsPatrick Altmeyer, Arie van Deursen, Cynthia C. S. Liem
We present CounterfactualExplanations.jl: a package for generating Counterfactual Explanations (CE) and Algorithmic Recourse (AR) for black-box models in Julia. CE explain how inputs into a model need to change to yield specific model predictions. Explanations that involve realistic and actionable changes can be used to provide AR: a set of proposed actions for individuals to change an undesirable outcome for the better. In this article, we discuss the usefulness of CE for Explainable Artificial Intelligence and demonstrate the functionality of our package. The package is straightforward to use and designed with a focus on customization and extensibility. We envision it to one day be the go-to place for explaining arbitrary predictive models in Julia through a diverse suite of counterfactual generators.
SEFeb 25, 2024Code
Language Models for Code Completion: A Practical EvaluationMaliheh Izadi, Jonathan Katzy, Tim van Dam et al.
Transformer-based language models for automatic code completion have shown great promise so far, yet the evaluation of these models rarely uses real data. This study provides both quantitative and qualitative assessments of three public code language models when completing real-world code. We first developed an open-source IDE extension, Code4Me, for the online evaluation of the models. We collected real auto-completion usage data for over a year from more than 1200 users, resulting in over 600K valid completions. These models were then evaluated using six standard metrics across twelve programming languages. Next, we conducted a qualitative study of 1690 real-world completion requests to identify the reasons behind the poor model performance. A comparative analysis of the models' performance in online and offline settings was also performed, using benchmark synthetic datasets and two masking strategies. Our findings suggest that while developers utilize code completion across various languages, the best results are achieved for mainstream languages such as Python and Java. InCoder outperformed the other models across all programming languages, highlighting the significance of training data and objectives. Our study also revealed that offline evaluations do not accurately reflect real-world scenarios. Upon qualitative analysis of the model's predictions, we found that 66.3% of failures were due to the models' limitations, 24.4% occurred due to inappropriate model usage in a development context, and 9.3% were valid requests that developers overwrote. Given these findings, we propose several strategies to overcome the current limitations. These include refining training objectives, improving resilience to typographical errors, adopting hybrid approaches, and enhancing implementations and usability.
LGNov 17, 2023
Is Your Anomaly Detector Ready for Change? Adapting AIOps Solutions to the Real WorldLorena Poenaru-Olaru, Natalia Karpova, Luis Cruz et al.
Anomaly detection techniques are essential in automating the monitoring of IT systems and operations. These techniques imply that machine learning algorithms are trained on operational data corresponding to a specific period of time and that they are continuously evaluated on newly emerging data. Operational data is constantly changing over time, which affects the performance of deployed anomaly detection models. Therefore, continuous model maintenance is required to preserve the performance of anomaly detectors over time. In this work, we analyze two different anomaly detection model maintenance techniques in terms of the model update frequency, namely blind model retraining and informed model retraining. We further investigate the effects of updating the model by retraining it on all the available data (full-history approach) and only the newest data (sliding window approach). Moreover, we investigate whether a data change monitoring tool is capable of determining when the anomaly detection model needs to be updated through retraining.
LGJul 21, 2023
Batching for Green AI -- An Exploratory Study on InferenceTim Yarally, Luís Cruz, Daniel Feitosa et al.
The batch size is an essential parameter to tune during the development of new neural networks. Amongst other quality indicators, it has a large degree of influence on the model's accuracy, generalisability, training times and parallelisability. This fact is generally known and commonly studied. However, during the application phase of a deep learning model, when the model is utilised by an end-user for inference, we find that there is a disregard for the potential benefits of introducing a batch size. In this study, we examine the effect of input batching on the energy consumption and response times of five fully-trained neural networks for computer vision that were considered state-of-the-art at the time of their publication. The results suggest that batching has a significant effect on both of these metrics. Furthermore, we present a timeline of the energy efficiency and accuracy of neural networks over the past decade. We find that in general, energy consumption rises at a much steeper pace than accuracy and question the necessity of this evolution. Additionally, we highlight one particular network, ShuffleNetV2(2018), that achieved a competitive performance for its time while maintaining a much lower energy consumption. Nevertheless, we highlight that the results are model dependent.
65.6AIApr 21Code
Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag ChallengesAli Al-Kaswan, Maksim Plotnikov, Maxim Hájek et al.
Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.
69.5LGApr 4
Automated Attention Pattern Discovery at Scale in Large Language ModelsJonathan Katzy, Razvan-Mihai Popescu, Erik Mekkes et al.
Large language models have found success by scaling up capabilities to work in general settings. The same can unfortunately not be said for interpretability methods. The current trend in mechanistic interpretability is to provide precise explanations of specific behaviors in controlled settings. These often do not generalize, or are too resource intensive for larger studies. In this work we propose to study repeated behaviors in large language models by mining completion scenarios in Java code datasets, through exploiting the structured nature of code. We collect the attention patterns generated in the attention heads to demonstrate that they are scalable signals for global interpretability of model components. We show that vision models offer a promising direction for analyzing attention patterns at scale. To demonstrate this, we introduce the Attention Pattern - Masked Autoencoder(AP-MAE), a vision transformer-based model that efficiently reconstructs masked attention patterns. Experiments on StarCoder2 show that AP-MAE (i) reconstructs masked attention patterns with high accuracy, (ii) generalizes across unseen models with minimal degradation, (iii) reveals recurring patterns across inferences, (iv) predicts whether a generation will be correct without access to ground truth, with accuracies ranging from 55% to 70% depending on the task, and (v) enables targeted interventions that increase accuracy by 13.6% when applied selectively, but cause collapse when applied excessively. These results establish attention patterns as a scalable signal for interpretability and demonstrate that AP-MAE provides a transferable foundation for both analysis and intervention in large language models. Beyond its standalone value, AP-MAE also serves as a selection procedure to guide fine-grained mechanistic approaches. We release code and models to support future work in large-scale interpretability.
SEApr 2, 2025Code
Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming TasksAli Al-Kaswan, Sebastian Deatc, Begüm Koç et al.
Nowadays, developers increasingly rely on solutions powered by Large Language Models (LLM) to assist them with their coding tasks. This makes it crucial to align these tools with human values to prevent malicious misuse. In this paper, we propose a comprehensive framework for assessing the potential harmfulness of LLMs within the software engineering domain. We begin by developing a taxonomy of potentially harmful software engineering scenarios and subsequently, create a dataset of prompts based on this taxonomy. To systematically assess the responses, we design and validate an automatic evaluator that classifies the outputs of a variety of LLMs both open-source and closed-source models, as well as general-purpose and code-specific LLMs. Furthermore, we investigate the impact of models size, architecture family, and alignment strategies on their tendency to generate harmful content. The results show significant disparities in the alignment of various LLMs for harmlessness. We find that some models and model families, such as Openhermes, are more harmful than others and that code-specific models do not perform better than their general-purpose counterparts. Notably, some fine-tuned models perform significantly worse than their base-models due to their design choices. On the other side, we find that larger models tend to be more helpful and are less likely to respond with harmful information. These results highlight the importance of targeted alignment strategies tailored to the unique challenges of software engineering tasks and provide a foundation for future work in this critical area.
LGMar 8, 2025Code
WaveStitch: Flexible and Fast Conditional Time Series Generation with Diffusion ModelsAditya Shankar, Lydia Y. Chen, Arie van Deursen et al.
Generating temporal data under conditions is crucial for forecasting, imputation, and generative tasks. Such data often has metadata and partially observed signals that jointly influence the generated values. However, existing methods face three key limitations: (1) they condition on either the metadata or observed values, but rarely both together; (2) they adopt either training-time approaches that fail to generalize to unseen scenarios, or inference-time approaches that ignore metadata; and (3) they suffer from trade-offs between generation speed and temporal coherence across time windows--choosing either slow but coherent autoregressive methods or fast but incoherent parallel ones. We propose WaveStitch, a novel diffusion-based method to overcome these hurdles through: (1) dual-sourced conditioning on both metadata and partially observed signals; (2) a hybrid training-inference architecture, incorporating metadata during training and observations at inference via gradient-based guidance; and (3) a novel pipeline-style paradigm that generates time windows in parallel while preserving coherence through an inference-time conditional loss and a stitching mechanism. Across diverse datasets, WaveStitch demonstrates adaptability to arbitrary patterns of observed signals, achieving 1.81x lower mean-squared-error compared to the state-of-the-art, and generates data up to 166.48x faster than autoregressive methods while maintaining coherence. Our code is available at: https://github.com/adis98/WaveStitch
CLJan 16, 2025Code
The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language ModelsJonathan Katzy, Razvan Mihai Popescu, Arie van Deursen et al.
The recent rise in the popularity of large language models has spurred the development of extensive code datasets needed to train them. This has left limited code available for collection and use in the downstream investigation of specific behaviors, or evaluation of large language models without suffering from data contamination. To address this problem, we release The Heap, a large multilingual dataset covering 57 programming languages that has been deduplicated with respect to other open datasets of code, enabling researchers to conduct fair evaluations of large language models without significant data cleaning overhead.
LGJan 22
Counterfactual Training: Teaching Models Plausible and Actionable ExplanationsPatrick Altmeyer, Aleksander Buszydlik, Arie van Deursen et al.
We propose a novel training regime termed counterfactual training that leverages counterfactual explanations to increase the explanatory capacity of models. Counterfactual explanations have emerged as a popular post-hoc explanation method for opaque machine learning models: they inform how factual inputs would need to change in order for a model to produce some desired output. To be useful in real-world decision-making systems, counterfactuals should be plausible with respect to the underlying data and actionable with respect to the feature mutability constraints. Much existing research has therefore focused on developing post-hoc methods to generate counterfactuals that meet these desiderata. In this work, we instead hold models directly accountable for the desired end goal: counterfactual training employs counterfactuals during the training phase to minimize the divergence between learned representations and plausible, actionable explanations. We demonstrate empirically and theoretically that our proposed method facilitates training models that deliver inherently desirable counterfactual explanations and additionally exhibit improved adversarial robustness.
LGJun 17, 2024Code
Long Code Arena: a Set of Benchmarks for Long-Context Code ModelsEgor Bogomolov, Aleksandra Eliseeva, Timur Galimzyanov et al.
Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.
SEJan 20, 2022Code
"Project smells" -- Experiences in Analysing the Software Quality of ML Projects with mllintBart van Oort, Luís Cruz, Babak Loni et al.
Machine Learning (ML) projects incur novel challenges in their development and productionisation over traditional software applications, though established principles and best practices in ensuring the project's software quality still apply. While using static analysis to catch code smells has been shown to improve software quality attributes, it is only a small piece of the software quality puzzle, especially in the case of ML projects given their additional challenges and lower degree of Software Engineering (SE) experience in the data scientists that develop them. We introduce the novel concept of project smells which consider deficits in project management as a more holistic perspective on software quality in ML projects. An open-source static analysis tool mllint was also implemented to help detect and mitigate these. Our research evaluates this novel concept of project smells in the industrial context of ING, a global bank and large software- and data-intensive organisation. We also investigate the perceived importance of these project smells for proof-of-concept versus production-ready ML projects, as well as the perceived obstructions and benefits to using static analysis tools such as mllint. Our findings indicate a need for context-aware static analysis tools, that fit the needs of the project at its current stage of development, while requiring minimal configuration effort from the user.
SEMar 6, 2021Code
The Prevalence of Code Smells in Machine Learning projectsBart van Oort, Luís Cruz, Maurício Aniche et al.
Artificial Intelligence (AI) and Machine Learning (ML) are pervasive in the current computer science landscape. Yet, there still exists a lack of software engineering experience and best practices in this field. One such best practice, static code analysis, can be used to find code smells, i.e., (potential) defects in the source code, refactoring opportunities, and violations of common coding standards. Our research set out to discover the most prevalent code smells in ML projects. We gathered a dataset of 74 open-source ML projects, installed their dependencies and ran Pylint on them. This resulted in a top 20 of all detected code smells, per category. Manual analysis of these smells mainly showed that code duplication is widespread and that the PEP8 convention for identifier naming style may not always be applicable to ML code due to its resemblance with mathematical notation. More interestingly, however, we found several major obstructions to the maintainability and reproducibility of ML projects, primarily related to the dependency management of Python projects. We also found that Pylint cannot reliably check for correct usage of imported dependencies, including prominent ML libraries such as PyTorch.
SEMar 2, 2021Code
An Exploratory Study of Log Placement Recommendation in an Enterprise SystemJeanderson Cândido, Jan Haesen, Maurício Aniche et al.
Logging is a development practice that plays an important role in the operations and monitoring of complex systems. Developers place log statements in the source code and use log data to understand how the system behaves in production. Unfortunately, anticipating where to log during development is challenging. Previous studies show the feasibility of leveraging machine learning to recommend log placement despite the data imbalance since logging is a fraction of the overall code base. However, it remains unknown how those techniques apply to an industry setting, and little is known about the effect of imbalanced data and sampling techniques. In this paper, we study the log placement problem in the code base of Adyen, a large-scale payment company. We analyze 34,526 Java files and 309,527 methods that sum up +2M SLOC. We systematically measure the effectiveness of five models based on code metrics, explore the effect of sampling techniques, understand which features models consider to be relevant for the prediction, and evaluate whether we can exploit 388,086 methods from 29 Apache projects to learn where to log in an industry setting. Our best performing model achieves 79% of balanced accuracy, 81% of precision, 60% of recall. While sampling techniques improve recall, they penalize precision at a prohibitive cost. Experiments with open-source data yield under-performing models over Adyen's test set; nevertheless, they are useful due to their low rate of false positives. Our supporting scripts and tools are available to the community.
SEJan 13, 2020Code
Generating Class-Level Integration Tests Using Call Site InformationPouria Derakhshanfar, Xavier Devroey, Annibale Panichella et al.
Search-based approaches have been used in the literature to automate the process of creating unit test cases. However, related work has shown that generated unit-tests with high code coverage could be ineffective, i.e., they may not detect all faults or kill all injected mutants. In this paper, we propose CLING, an integration-level test case generation approach that exploits how a pair of classes, the caller and the callee, interact with each other through method calls. In particular, CLING generates integration-level test cases that maximize the Coupled Branches Criterion (CBC). Coupled branches are pairs of branches containing a branch of the caller and a branch of the callee such that an integration test that exercises the former also exercises the latter. CBC is a novel integration-level coverage criterion, measuring the degree to which a test suite exercises the interactions between a caller and its callee classes. We implemented CLING and evaluated the approach on 140 pairs of classes from five different open-source Java projects. Our results show that (1) CLING generates test suites with high CBC coverage, thanks to the definition of the test suite generation as a many-objectives problem where each couple of branches is an independent objective; (2) such generated suites trigger different class interactions and can kill on average 7.7% (with a maximum of 50%) of mutants that are not detected by tests generated at the unit level; (3) CLING can detect integration faults coming from wrong assumptions about the usage of the callee class (32 for our subject systems) that remain undetected when using automatically generated unit-level test suites.
SEDec 12, 2019Code
Log-based software monitoring: a systematic mapping studyJeanderson Barros Cândido, Maurício Finavaro Aniche, Arie van Deursen
Modern software development and operations rely on monitoring to understand how systems behave in production. The data provided by application logs and runtime environment are essential to detect and diagnose undesired behavior and improve system reliability. However, despite the rich ecosystem around industry-ready log solutions, monitoring complex systems and getting insights from log data remains a challenge. Researchers and practitioners have been actively working to address several challenges related to logs, e.g., how to effectively provide better tooling support for logging decisions to developers, how to effectively process and store log data, and how to extract insights from log data. A holistic view of the research effort on logging practices and automated log analysis is key to provide directions and disseminate the state-of-the-art for technology transfer. In this paper, we study 108 papers (72 research track papers, 24 journals, and 12 industry track papers) from different communities (e.g., machine learning, software engineering, and systems) and structure the research field in light of the life-cycle of log data. Our analysis shows that (1) logging is challenging not only in open-source projects but also in industry, (2) machine learning is a promising approach to enable a contextual analysis of source code for log recommendation but further investigation is required to assess the usability of those tools in practice, (3) few studies approached efficient persistence of log data, and (4) there are open opportunities to analyze application logs and to evaluate state-of-the-art log analysis techniques in a DevOps context.
SEDec 10, 2019Code
Search-based Crash Reproduction using Behavioral Model SeedingPouria Derakhshanfar, Xavier Devroey, Gilles Perrouin et al.
Search-based crash reproduction approaches assist developers during debugging by generating a test case which reproduces a crash given its stack trace. One of the fundamental steps of this approach is creating objects needed to trigger the crash. One way to overcome this limitation is seeding: using information about the application during the search process. With seeding, the existing usages of classes can be used in the search process to produce realistic sequences of method calls which create the required objects. In this study, we introduce behavioral model seeding: a new seeding method which learns class usages from both the system under test and existing test cases. Learned usages are then synthesized in a behavioral model (state machine). Then, this model serves to guide the evolutionary process. To assess behavioral model-seeding, we evaluate it against test-seeding (the state-of-the-art technique for seeding realistic objects) and no-seeding (without seeding any class usage). For this evaluation, we use a benchmark of 124 hard-to-reproduce crashes stemming from six open-source projects. Our results indicate that behavioral model-seeding outperforms both test seeding and no-seeding by a minimum of 6% without any notable negative impact on efficiency.
CRDec 18, 2023
Traces of Memorisation in Large Language Models for CodeAli Al-Kaswan, Maliheh Izadi, Arie van Deursen
Large language models have gained significant popularity because of their ability to generate human-like text and potential applications in various fields, such as Software Engineering. Large language models for code are commonly trained on large unsanitised corpora of source code scraped from the internet. The content of these datasets is memorised and can be extracted by attackers with data extraction attacks. In this work, we explore memorisation in large language models for code and compare the rate of memorisation with large language models trained on natural language. We adopt an existing benchmark for natural language and construct a benchmark for code by identifying samples that are vulnerable to attack. We run both benchmarks against a variety of models, and perform a data extraction attack. We find that large language models for code are vulnerable to data extraction attacks, like their natural language counterparts. From the training data that was identified to be potentially extractable we were able to extract 47% from a CodeGen-Mono-16B code completion model. We also observe that models memorise more, as their parameter count grows, and that their pre-training data are also vulnerable to attack. We also find that data carriers are memorised at a higher rate than regular code or documentation and that different model architectures memorise different samples. Data leakage has severe outcomes, so we urge the research community to further investigate the extent of this phenomenon using a wider range of models and extraction techniques in order to build safeguards to mitigate this issue.
SEMar 22, 2024
An Exploratory Investigation into Code License Infringements in Large Language Model Training DatasetsJonathan Katzy, Răzvan-Mihai Popescu, Arie van Deursen et al.
Does the training of large language models potentially infringe upon code licenses? Furthermore, are there any datasets available that can be safely used for training these models without violating such licenses? In our study, we assess the current trends in the field and the importance of incorporating code into the training of large language models. Additionally, we examine publicly available datasets to see whether these models can be trained on them without the risk of legal issues in the future. To accomplish this, we compiled a list of 53 large language models trained on file-level code. We then extracted their datasets and analyzed how much they overlap with a dataset we created, consisting exclusively of strong copyleft code. Our analysis revealed that every dataset we examined contained license inconsistencies, despite being selected based on their associated repository licenses. We analyzed a total of 514 million code files, discovering 38 million exact duplicates present in our strong copyleft dataset. Additionally, we examined 171 million file-leading comments, identifying 16 million with strong copyleft licenses and another 11 million comments that discouraged copying without explicitly mentioning a license. Based on the findings of our study, which highlights the pervasive issue of license inconsistencies in large language models trained on code, our recommendation for both researchers and the community is to prioritize the development and adoption of best practices for dataset creation and management.
SEMay 23, 2024
A Transformer-Based Approach for Smart Invocation of Automatic Code CompletionAral de Moor, Arie van Deursen, Maliheh Izadi
Transformer-based language models are highly effective for code completion, with much research dedicated to enhancing the content of these completions. Despite their effectiveness, these models come with high operational costs and can be intrusive, especially when they suggest too often and interrupt developers who are concentrating on their work. Current research largely overlooks how these models interact with developers in practice and neglects to address when a developer should receive completion suggestions. To tackle this issue, we developed a machine learning model that can accurately predict when to invoke a code completion tool given the code context and available telemetry data. To do so, we collect a dataset of 200k developer interactions with our cross-IDE code completion plugin and train several invocation filtering models. Our results indicate that our small-scale transformer model significantly outperforms the baseline while maintaining low enough latency. We further explore the search space for integrating additional telemetry data into a pre-trained transformer directly and obtain promising results. To further demonstrate our approach's practical potential, we deployed the model in an online environment with 34 developers and provided real-world insights based on 74k actual invocations.
LGJan 15, 2024
Data vs. Model Machine Learning Fairness Testing: An Empirical StudyArumoy Shome, Luis Cruz, Arie van Deursen
Although several fairness definitions and bias mitigation techniques exist in the literature, all existing solutions evaluate fairness of Machine Learning (ML) systems after the training stage. In this paper, we take the first steps towards evaluating a more holistic approach by testing for fairness both before and after model training. We evaluate the effectiveness of the proposed approach and position it within the ML development lifecycle, using an empirical analysis of the relationship between model dependent and independent fairness metrics. The study uses 2 fairness metrics, 4 ML algorithms, 5 real-world datasets and 1600 fairness evaluation cycles. We find a linear relationship between data and model fairness metrics when the distribution and the size of the training data changes. Our results indicate that testing for fairness prior to training can be a ``cheap'' and effective means of catching a biased data collection process early; detecting data drifts in production systems and minimising execution of full training cycles thus reducing development time and costs.
LGDec 17, 2023
Faithful Model Explanations through Energy-Constrained Conformal CounterfactualsPatrick Altmeyer, Mojtaba Farmanbar, Arie van Deursen et al.
Counterfactual explanations offer an intuitive and straightforward way to explain black-box models and offer algorithmic recourse to individuals. To address the need for plausible explanations, existing work has primarily relied on surrogate models to learn how the input data is distributed. This effectively reallocates the task of learning realistic explanations for the data from the model itself to the surrogate. Consequently, the generated explanations may seem plausible to humans but need not necessarily describe the behaviour of the black-box model faithfully. We formalise this notion of faithfulness through the introduction of a tailored evaluation metric and propose a novel algorithmic framework for generating Energy-Constrained Conformal Counterfactuals that are only as plausible as the model permits. Through extensive empirical studies, we demonstrate that ECCCo reconciles the need for faithfulness and plausibility. In particular, we show that for models with gradient access, it is possible to achieve state-of-the-art performance without the need for surrogate models. To do so, our framework relies solely on properties defining the black-box model itself by leveraging recent advances in energy-based modelling and conformal prediction. To our knowledge, this is the first venture in this direction for generating faithful counterfactual explanations. Thus, we anticipate that ECCCo can serve as a baseline for future research. We believe that our work opens avenues for researchers and practitioners seeking tools to better distinguish trustworthy from unreliable models.
SEMay 21, 2025
A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation MetricsJonathan Katzy, Yongcheng Huang, Gopal-Raj Panchu et al.
Large Language Models are essential coding assistants, yet their training is predominantly English-centric. In this study, we evaluate the performance of code language models in non-English contexts, identifying challenges in their adoption and integration into multilingual workflows. We conduct an open-coding study to analyze errors in code comments generated by five state-of-the-art code models, CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2 across five natural languages: Chinese, Dutch, English, Greek, and Polish. Our study yields a dataset of 12,500 labeled generations, which we publicly release. We then assess the reliability of standard metrics in capturing comment \textit{correctness} across languages and evaluate their trustworthiness as judgment criteria. Through our open-coding investigation, we identified a taxonomy of 26 distinct error categories in model-generated code comments. They highlight variations in language cohesion, informativeness, and syntax adherence across different natural languages. Our analysis shows that, while these models frequently produce partially correct comments, modern neural metrics fail to reliably differentiate meaningful completions from random noise. Notably, the significant score overlap between expert-rated correct and incorrect comments calls into question the effectiveness of these metrics in assessing generated comments.
SEOct 11, 2025
Prepared for the Unknown: Adapting AIOps Capacity Forecasting Models to Data ChangesLorena Poenaru-Olaru, Wouter van 't Hof, Adrian Stando et al.
Capacity management is critical for software organizations to allocate resources effectively and meet operational demands. An important step in capacity management is predicting future resource needs often relies on data-driven analytics and machine learning (ML) forecasting models, which require frequent retraining to stay relevant as data evolves. Continuously retraining the forecasting models can be expensive and difficult to scale, posing a challenge for engineering teams tasked with balancing accuracy and efficiency. Retraining only when the data changes appears to be a more computationally efficient alternative, but its impact on accuracy requires further investigation. In this work, we investigate the effects of retraining capacity forecasting models for time series based on detected changes in the data compared to periodic retraining. Our results show that drift-based retraining achieves comparable forecasting accuracy to periodic retraining in most cases, making it a cost-effective strategy. However, in cases where data is changing rapidly, periodic retraining is still preferred to maximize the forecasting accuracy. These findings offer actionable insights for software teams to enhance forecasting systems, reducing retraining overhead while maintaining robust performance.
SEAug 4, 2025
TreeRanker: Fast and Model-agnostic Ranking System for Code Suggestions in IDEsDaniele Cipollone, Egor Bogomolov, Arie van Deursen et al.
Token-level code completion is one of the most critical features in modern Integrated Development Environments (IDEs). It assists developers by suggesting relevant identifiers and APIs during coding. While completions are typically derived from static analysis, their usefulness depends heavily on how they are ranked, as correct predictions buried deep in the list are rarely seen by users. Most current systems rely on hand-crafted heuristics or lightweight machine learning models trained on user logs, which can be further improved to capture context information and generalize across projects and coding styles. In this work, we propose a new scoring approach to ranking static completions using language models in a lightweight and model-agnostic way. Our method organizes all valid completions into a prefix tree and performs a single greedy decoding pass to collect token-level scores across the tree. This enables a precise token-aware ranking without needing beam search, prompt engineering, or model adaptations. The approach is fast, architecture-agnostic, and compatible with already deployed models for code completion. These findings highlight a practical and effective pathway for integrating language models into already existing tools within IDEs, and ultimately providing smarter and more responsive developer assistance.
LGJun 16, 2025
Sustainable Machine Learning Retraining: Optimizing Energy Efficiency Without Compromising AccuracyLorena Poenaru-Olaru, June Sallou, Luis Cruz et al.
The reliability of machine learning (ML) software systems is heavily influenced by changes in data over time. For that reason, ML systems require regular maintenance, typically based on model retraining. However, retraining requires significant computational demand, which makes it energy-intensive and raises concerns about its environmental impact. To understand which retraining techniques should be considered when designing sustainable ML applications, in this work, we study the energy consumption of common retraining techniques. Since the accuracy of ML systems is also essential, we compare retraining techniques in terms of both energy efficiency and accuracy. We showcase that retraining with only the most recent data, compared to all available data, reduces energy consumption by up to 25\%, being a sustainable alternative to the status quo. Furthermore, our findings show that retraining a model only when there is evidence that updates are necessary, rather than on a fixed schedule, can reduce energy consumption by up to 40\%, provided a reliable data change detector is in place. Our findings pave the way for better recommendations for ML practitioners, guiding them toward more energy-efficient retraining techniques when designing sustainable ML software systems.
SEJan 25, 2024
McUDI: Model-Centric Unsupervised Degradation Indicator for Failure Prediction AIOps SolutionsLorena Poenaru-Olaru, Luis Cruz, Jan Rellermeyer et al.
Due to the continuous change in operational data, AIOps solutions suffer from performance degradation over time. Although periodic retraining is the state-of-the-art technique to preserve the failure prediction AIOps models' performance over time, this technique requires a considerable amount of labeled data to retrain. In AIOps obtaining label data is expensive since it requires the availability of domain experts to intensively annotate it. In this paper, we present McUDI, a model-centric unsupervised degradation indicator that is capable of detecting the exact moment the AIOps model requires retraining as a result of changes in data. We further show how employing McUDI in the maintenance pipeline of AIOps solutions can reduce the number of samples that require annotations with 30k for job failure prediction and 260k for disk failure prediction while achieving similar performance with periodic retraining.
SEFeb 4, 2022
Using Large-scale Heterogeneous Graph Representation Learning for Code Review Recommendations at MicrosoftJiyang Zhang, Chandra Maddila, Ram Bairi et al.
Code review is an integral part of any mature software development process, and identifying the best reviewer for a code change is a well-accepted problem within the software engineering community. Selecting a reviewer who lacks expertise and understanding can slow development or result in more defects. To date, most reviewer recommendation systems rely primarily on historical file change and review information; those who changed or reviewed a file in the past are the best positioned to review in the future. We posit that while these approaches are able to identify and suggest qualified reviewers, they may be blind to reviewers who have the needed expertise and have simply never interacted with the changed files before. Fortunately, at Microsoft, we have a wealth of work artifacts across many repositories that can yield valuable information about our developers. To address the aforementioned problem, we present CORAL, a novel approach to reviewer recommendation that leverages a socio-technical graph built from the rich set of entities (developers, repositories, files, pull requests (PRs), work items, etc.) and their relationships in modern source code management systems. We employ a graph convolutional neural network on this graph and train it on two and a half years of history on 332 repositories within Microsoft. We show that CORAL is able to model the manual history of reviewer selection remarkably well. Further, based on an extensive user study, we demonstrate that this approach identifies relevant and qualified reviewers who traditional reviewer recommenders miss, and that these developers desire to be included in the review process. Finally, we find that "classical" reviewer recommendation systems perform better on smaller (in terms of developers) software projects while CORAL excels on larger projects, suggesting that there is "no one model to rule them all."
SEOct 15, 2021
Nalanda: A Socio-Technical Graph for Building Software Analytics Tools at Enterprise ScaleChandra Maddila, Suhas Shanbhogue, Apoorva Agrawal et al.
Software development is information-dense knowledge work that requires collaboration with other developers and awareness of artifacts such as work items, pull requests, and files. With the speed of development increasing, information overload is a challenge for people developing and maintaining these systems. Finding information and people is difficult for software engineers, especially when they work in large software systems or have just recently joined a project. In this paper, we build a large scale data platform named Nalanda platform, which contains two subsystems: 1. A large scale socio-technical graph system, named Nalanda graph system 2. A large scale recommendation system, named Nalanda index system that aims at satisfying the information needs of software developers. The Nalanda graph is an enterprise scale graph with data from 6,500 repositories, with 37,410,706 nodes and 128,745,590 edges. On top of the Nalanda graph system, we built software analytics applications including a newsfeed named MyNalanda, and based on organic growth alone, it has Daily Active Users (DAU) of 290 and Monthly Active Users (MAU) of 590. A preliminary user study shows that 74% of developers and engineering managers surveyed are favorable toward continued use of the platform for information discovery. The Nalanda index system constitutes two indices: artifact index and expert index. It uses the socio-technical graph (Nalanda graph system) to rank the results and provide better recommendations to software developers. A large scale quantitative evaluation shows that the Nalanda index system provides recommendations with an accuracy of 78% for the top three recommendations.
SEApr 8, 2021
Secure Software Engineering in the Financial Services: A Practitioners' PerspectiveVivek Arora, Enrique Larios Vargas, Maurício Aniche et al.
Secure software engineering is a fundamental activity in modern software development. However, while the field of security research has been advancing quite fast, in practice, there is still a vast knowledge gap between the security experts and the software development teams. After all, we cannot expect developers and other software practitioners to be security experts. Understanding how software development teams incorporate security in their processes and the challenges they face is a step towards reducing this gap. In this paper, we study how financial services companies ensure the security of their software systems. To that aim, we performed a qualitative study based on semi-structured interviews with 16 software practitioners from 11 different financial companies in three continents. Our results shed light on the security considerations that practitioners take during the different phases of their software development processes, the different security practices that software teams make use of to ensure the security of their software systems, the improvements that practitioners perceive as important in existing state-of-the-practice security tools, the different knowledge-sharing and learning practices that developers use to learn more about software security, and the challenges that software practitioners currently face when it comes to secure their systems.
SEJan 16, 2021
ConE: A Concurrent Edit Detection Tool for Large Scale Software DevelopmentChandra Maddila, Nachiappan Nagappan, Christian Bird et al.
Modern, complex software systems are being continuously extended and adjusted. The developers responsible for this may come from different teams or organizations, and may be distributed over the world. This may make it difficult to keep track of what other developers are doing, which may result in multiple developers concurrently editing the same code areas. This, in turn, may lead to hard-to-merge changes or even merge conflicts, logical bugs that are difficult to detect, duplication of work, and wasted developer productivity. To address this, we explore the extent of this problem in the pull request based software development model. We study half a year of changes made to six large repositories in Microsoft in which at least 1,000 pull requests are created each month. We find that files concurrently edited in different pull requests are more likely to introduce bugs. Motivated by these findings, we design, implement, and deploy a service named ConE (Concurrent Edit Detector) that proactively detects pull requests containing concurrent edits, to help mitigate the problems caused by them. ConE has been designed to scale, and to minimize false alarms while still flagging relevant concurrently edited files. Key concepts of ConE include the detection of the Extent of Overlap between pull requests, and the identification of Rarely Concurrently Edited Files. To evaluate ConE, we report on its operational deployment on 234 repositories inside Microsoft. ConE assessed 26,000 pull requests and made 775 recommendations about conflicting changes, which were rated as useful in over 70% (554) of the cases. From interviews with 48 users we learned that they believed ConE would save time in conflict resolution and avoiding duplicate work, and that over 90% intend to keep using the service on a daily basis.
SENov 25, 2020
Nudge: Accelerating Overdue Pull Requests Towards CompletionChandra Maddila, Sai Surya Upadrasta, Chetan Bansal et al.
Pull requests are a key part of the collaborative software development and code review process today. However, pull requests can also slow down the software development process when the reviewer(s) or the author do not actively engage with the pull request. In this work, we design an end-to-end service, Nudge, for accelerating overdue pull requests towards completion by reminding the author or the reviewer(s) to engage with their overdue pull requests. First, we use models based on effort estimation and machine learning to predict the completion time for a given pull request. Second, we use activity detection to filter out pull requests that may be overdue, but for which sufficient action is taking place nonetheless. Lastly, we use actor identification to understand who the blocker of the pull request is and nudge the appropriate actor (author or reviewer(s)). The key novelty of Nudge is that it succeeds in reducing pull request resolution time, while ensuring that developers perceive the notifications sent as useful, at the scale of thousands of repositories. In a randomized trial on 147 repositories in use at Microsoft, Nudge was able to reduce pull request resolution time by 60% for 8,500 pull requests, when compared to overdue pull requests for which Nudge did not send a notification. Furthermore, developers receiving Nudge notifications resolved 73% of these notifications as positive. We observed similar results when scaling up the deployment of Nudge to 8,000 repositories at Microsoft, for which Nudge sent 210,000 notifications during a full year. This demonstrates Nudge's ability to scale to thousands of repositories. Lastly, our qualitative analysis of a selection of Nudge notifications indicates areas for future research, such as taking dependencies among pull requests and developer availability into account.
SEOct 7, 2020
Questions for Data Scientists in Software Engineering: A ReplicationHennie Huijgens, Ayushi Rastogi, Ernst Mulders et al.
In 2014, a Microsoft study investigated the sort of questions that data science applied to software engineering should answer. This resulted in 145 questions that developers considered relevant for data scientists to answer, thus providing a research agenda to the community. Fast forward to five years, no further studies investigated whether the questions from the software engineers at Microsoft hold for other software companies, including software-intensive companies with different primary focus (to which we refer as software-defined enterprises). Furthermore, it is not evident that the problems identified five years ago are still applicable, given the technological advances in software engineering.
SEOct 3, 2020
AI Lifecycle Models Need To Be Revised. An Exploratory Study in FintechMark Haakman, Luís Cruz, Hennie Huijgens et al.
Tech-leading organizations are embracing the forthcoming artificial intelligence revolution. Intelligent systems are replacing and cooperating with traditional software components. Thus, the same development processes and standards in software engineering ought to be complied in artificial intelligence systems. This study aims to understand the processes by which artificial intelligence-based systems are developed and how state-of-the-art lifecycle models fit the current needs of the industry. We conducted an exploratory case study at ING, a global bank with a strong European base. We interviewed 17 people with different roles and from different departments within the organization. We have found that the following stages have been overlooked by previous lifecycle models: data collection, feasibility study, documentation, model monitoring, and model risk assessment. Our work shows that the real challenges of applying Machine Learning go much beyond sophisticated learning algorithms - more focus is needed on the entire lifecycle. In particular, regardless of the existing development tools for Machine Learning, we observe that they are still not meeting the particularities of this field.
SEMay 28, 2018
The effects of change decomposition on code review -- a controlled experimentMarco di Biase, Magiel Bruntink, Arie van Deursen et al.
Background: Code review is a cognitively demanding and time-consuming process. Previous qualitative studies hinted at how decomposing change sets into multiple yet internally coherent ones would improve the reviewing process. So far, literature provided no quantitative analysis of this hypothesis. Aims: (1) Quantitatively measure the effects of change decomposition on the outcome of code review (in terms of number of found defects, wrongly reported issues, suggested improvements, time, and understanding); (2) Qualitatively analyze how subjects approach the review and navigate the code, building knowledge and addressing existing issues, in large vs. decomposed changes. Method: Controlled experiment using the pull-based development model involving 28 software developers among professionals and graduate students. Results: Change decomposition leads to fewer wrongly reported issues, influences how subjects approach and conduct the review activity (by increasing context-seeking), yet impacts neither understanding the change rationale nor the number of found defects. Conclusions: Change decomposition reduces the noise for subsequent data analyses but also significantly supports the tasks of the developers in charge of reviewing the changes. As such, commits belonging to different concepts should be separated, adopting this as a best practice in software engineering.
SESep 16, 2012
Measuring Spreadsheet Formula UnderstandabilityFelienne Hermans, Martin Pinzger, Arie van Deursen
Spreadsheets are widely used in industry, because they are flexible and easy to use. Sometimes they are even used for business-critical applications. It is however difficult for spreadsheet users to correctly assess the quality of spreadsheets, especially with respect to their understandability. Understandability of spreadsheets is important, since spreadsheets often have a long lifespan, during which they are used by several users. In this paper, we establish a set of spreadsheet understandability metrics. We start by studying related work and interviewing 40 spreadsheet professionals to obtain a set of characteristics that might contribute to understandability problems in spreadsheets. Based on those characteristics we subsequently determine a number of understandability metrics. To evaluate the usefulness of our metrics, we conducted a series of experiments in which professional spreadsheet users performed a number of small maintenance tasks on a set of spreadsheets from the EUSES spreadsheet corpus. We subsequently calculate the correlation between the metrics and the performance of subjects on these tasks. The results clearly indicate that the number of ranges, the nesting depth and the presence of conditional operations in formulas significantly increase the difficulty of understanding a spreadsheet.