39.9SEMay 24
How do Agents Refactor: An Empirical StudyLukas Ottenhof, Daniel Penner, Abram Hindle et al.
Software development agents such as Claude Code, GitHub Copilot, Cursor Agent, Devin, and OpenAI Codex are being increasingly integrated into developer workflows. While prior work has evaluated agent capabilities for code completion and task automation, there is little work investigating how these agents perform Java refactoring in practice, the types of changes they make, and their impact on code quality. In this study, we present the first analysis of agentic refactoring pull requests in Java, comparing them to developer refactorings across 86 projects per group. Using RefactoringMiner and DesigniteJava 3.0, we identify refactoring types and detect code smells before and after refactoring commits. Our results show that agent refactorings are dominated by annotation changes (the 5 most common refactoring types done by agents are annotation related), in contrast to the diverse structural improvements typical of developers. Despite these differences in refactoring types, we find Cursor to be the only model to show a statistically significant increase in refactoring smells.
SPOct 6, 2022
ECG for high-throughput screening of multiple diseases: Proof-of-concept using multi-diagnosis deep learning from population-based datasetsWeijie Sun, Sunil Vasu Kalmady, Amir Salimi et al.
Electrocardiogram (ECG) abnormalities are linked to cardiovascular diseases, but may also occur in other non-cardiovascular conditions such as mental, neurological, metabolic and infectious conditions. However, most of the recent success of deep learning (DL) based diagnostic predictions in selected patient cohorts have been limited to a small set of cardiac diseases. In this study, we use a population-based dataset of >250,000 patients with >1000 medical conditions and >2 million ECGs to identify a wide range of diseases that could be accurately diagnosed from the patient's first in-hospital ECG. Our DL models uncovered 128 diseases and 68 disease categories with strong discriminative performance.
SPNov 14, 2022
Improving ECG-based COVID-19 diagnosis and mortality predictions using pre-pandemic medical records at population-scaleWeijie Sun, Sunil Vasu Kalmady, Nariman Sepehrvand et al.
Pandemic outbreaks such as COVID-19 occur unexpectedly, and need immediate action due to their potential devastating consequences on global health. Point-of-care routine assessments such as electrocardiogram (ECG), can be used to develop prediction models for identifying individuals at risk. However, there is often too little clinically-annotated medical data, especially in early phases of a pandemic, to develop accurate prediction models. In such situations, historical pre-pandemic health records can be utilized to estimate a preliminary model, which can then be fine-tuned based on limited available pandemic data. This study shows this approach -- pre-train deep learning models with pre-pandemic data -- can work effectively, by demonstrating substantial performance improvement over three different COVID-19 related diagnostic and prognostic prediction tasks. Similar transfer learning strategies can be useful for developing timely artificial intelligence solutions in future pandemic outbreaks.
54.8SEApr 22
Mining Type Constructs Using Patterns in AI-Generated CodeImgyeong Lee, Tayyib Ul Hassan, Abram Hindle
Artificial Intelligence (AI) increasingly automates various parts of the software development tasks. Although AI has enhanced the productivity of development tasks, it remains unstudied whether AI essentially outperforms humans in type-related programming tasks, such as employing type constructs properly for type safety, during its tasks. Moreover, there is no systematic study that evaluates whether AI agents overuse or misuse the type constructs under the complicated type systems to the same extent as humans. In this study, we present the first empirical analysis to answer these questions in the domain of TypeScript projects. Our findings show that, in contrast to humans, AI agents are 9x more prone to use the 'any' keyword. In addition, we observed that AI agents use advanced type constructs, including those that ignore type checks, more often compared to humans. Surprisingly, even with all these issues, Agentic pull requests (PRs) have 1.8x higher acceptance rates compared to humans for TypeScript. We encourage software developers to carefully confirm the type safety of their codebases whenever they coordinate with AI agents in the development process.
77.2CVMar 24
How Far Can VLMs Go for Visual Bug Detection? Studying 19,738 Keyframes from 41 Hours of Gameplay VideosWentao Lu, Alexander Senchenko, Alan Sayle et al.
Video-based quality assurance (QA) for long-form gameplay video is labor-intensive and error-prone, yet valuable for assessing game stability and visual correctness over extended play sessions. Vision language models (VLMs) promise general-purpose visual reasoning capabilities and thus appear attractive for detecting visual bugs directly from video frames. Recent benchmarks suggest that VLMs can achieve promising results in detecting visual glitches on curated datasets. Building on these findings, we conduct a real-world study using industrial QA gameplay videos to evaluate how well VLMs perform in practical scenarios. Our study samples keyframes from long gameplay videos and asks a VLM whether each keyframe contains a bug. Starting from a single-prompt baseline, the model achieves a precision of 0.50 and an accuracy of 0.72. We then examine two common enhancement strategies used to improve VLM performance without fine-tuning: (1) a secondary judge model that re-evaluates VLM outputs, and (2) metadata-augmented prompting through the retrieval of prior bug reports. Across \textbf{100 videos} totaling \textbf{41 hours} and \textbf{19,738 keyframes}, these strategies provide only marginal improvements over the simple baseline, while introducing additional computational cost and output variance. Our findings indicate that off-the-shelf VLMs are already capable of detecting a certain range of visual bugs in QA gameplay videos, but further progress likely requires hybrid approaches that better separate textual and visual anomaly detection.
SPNov 2, 2023
Exploring Best Practices for ECG Pre-Processing in Machine LearningAmir Salimi, Sunil Vasu Kalmady, Abram Hindle et al.
In this work we search for best practices in pre-processing of Electrocardiogram (ECG) signals in order to train better classifiers for the diagnosis of heart conditions. State of the art machine learning algorithms have achieved remarkable results in classification of some heart conditions using ECG data, yet there appears to be no consensus on pre-processing best practices. Is this lack of consensus due to different conditions and architectures requiring different processing steps for optimal performance? Is it possible that state of the art deep-learning models have rendered pre-processing unnecessary? In this work we apply down-sampling, normalization, and filtering functions to 3 different multi-label ECG datasets and measure their effects on 3 different high-performing time-series classifiers. We find that sampling rates as low as 50Hz can yield comparable results to the commonly used 500Hz. This is significant as smaller sampling rates will result in smaller datasets and models, which require less time and resources to train. Additionally, despite their common usage, we found min-max normalization to be slightly detrimental overall, and band-passing to make no measurable difference. We found the blind approach to pre-processing of ECGs for multi-label classification to be ineffective, with the exception of sample rate reduction which reliably reduces computational resources, but does not increase accuracy.
SPApr 26, 2024Code
Federated Learning and Differential Privacy Techniques on Multi-hospital Population-scale Electrocardiogram DataVikhyat Agrawal, Sunil Vasu Kalmady, Venkataseetharam Manoj Malipeddi et al.
This research paper explores ways to apply Federated Learning (FL) and Differential Privacy (DP) techniques to population-scale Electrocardiogram (ECG) data. The study learns a multi-label ECG classification model using FL and DP based on 1,565,849 ECG tracings from 7 hospitals in Alberta, Canada. The FL approach allowed collaborative model training without sharing raw data between hospitals while building robust ECG classification models for diagnosing various cardiac conditions. These accurate ECG classification models can facilitate the diagnoses while preserving patient confidentiality using FL and DP techniques. Our results show that the performance achieved using our implementation of the FL approach is comparable to that of the pooled approach, where the model is trained over the aggregating data from all hospitals. Furthermore, our findings suggest that hospitals with limited ECGs for training can benefit from adopting the FL model compared to single-site training. In addition, this study showcases the trade-off between model performance and data privacy by employing DP during model training. Our code is available at https://github.com/vikhyatt/Hospital-FL-DP.
SEDec 15, 2020Code
A Quantitative Study of Security Bug Fixes of GitHub RepositoriesDaito Nakano, Mingyang Yin, Ryosuke Sato et al.
Software is prone to bugs and failures. Security bugs are those that expose or share privileged information and access in violation of the software's requirements. Given the seriousness of security bugs, there are centralized mechanisms for supporting and tracking these bugs across multiple products, one such mechanism is the Common Vulnerabilities and Exposures (CVE) ID description. When a bug gets a CVE, it is referenced by its CVE ID. Thus we explore thousands of Free/Libre Open Source Software (FLOSS) projects, on Github, to determine if developers reference or discuss CVEs in their code, commits, and issues. CVEs will often refer to 3rd party software dependencies of a project and thus the bug will not be in the actual product itself. We study how many of these references are intentional CVE references, and how many are relevant bugs within the projects themselves. We investigate how the bugs that reference CVEs are fixed and how long it takes to fix these bugs. The results of our manual classification for 250 bug reports show that 88 (35%), 32 (13%), and 130 (52%) are classified into "Version Update", "Fixing Code", and "Discussion". To understand how long it takes to fix those bugs, we compare two periods, Reporting Period, a period between the disclosure date of vulnerability information in CVE repositories and the creation date of the bug report in a project, and Fixing Period, a period between the creation date of the bug report and the fixing date of the bug report. We find that 44% of bug reports that are classified into "Version Update" or "Fixing Code" have longer Reporting Period than Fixing Period. This suggests that those who submit CVEs should notify affected projects more directly.
SEOct 28, 2013Code
Mining the Temporal Evolution of the Android Bug Reporting Community via Sliding WindowsFeng Jiang, Jiemin Wang, Abram Hindle et al.
The open source development community consists of both paid and volunteer developers as well as new and experienced users. Previous work has applied social network analysis (SNA) to open source communities and has demonstrated value in expertise discovery and triaging. One problem with applying SNA directly to the data of the entire project lifetime is that the impact of local activities will be drowned out. In this paper we provide a method for aggregating, analyzing, and visualizing local (small time periods) interactions of bug reporting participants by using the SNA to measure the betweeness centrality of these participants. In particular we mined the Android bug repository by producing social networks from overlapping 30-day windows of bug reports, each sliding over by day. In this paper we define three patterns of participant behaviour based on their local centrality. We propose a method of analyzing the centrality of bug report participants both locally and globally, then we conduct a thorough case study of the bug reporter's activity within the Android bug repository. Furthermore, we validate the conclusions of our method by mining the Android version control system and inspecting the Android release history. We found that windowed SNA analysis elicited local behaviour that were invisible during global analysis.
CLSep 22, 2025
Dorabella Cipher as Musical InspirationBradley Hauer, Colin Choi, Abram Hindle et al.
The Dorabella cipher is an encrypted note written by English composer Edward Elgar, which has defied decipherment attempts for more than a century. While most proposed solutions are English texts, we investigate the hypothesis that Dorabella represents enciphered music. We weigh the evidence for and against the hypothesis, devise a simplified music notation, and attempt to reconstruct a melody from the cipher. Our tools are n-gram models of music which we validate on existing music corpora enciphered using monoalphabetic substitution. By applying our methods to Dorabella, we produce a decipherment with musical qualities, which is then transformed via artful composition into a listenable melody. Far from arguing that the end result represents the only true solution, we instead frame the process of decipherment as part of the composition process.
SEJun 21, 2021
An empirical evaluation of the usefulness of Tree Kernels for Commit-time Defect Detection in large software systemsHareem Sahar, Yuxin Liu, Abram Hindle et al.
Defect detection at commit check-in time prevents the introduction of defects into software systems. Current defect detection approaches rely on metric-based models which are not very accurate and whose results are not directly useful for developers. We propose a method to detect bug-inducing commits by comparing the incoming changes with all past commits in the project, considering both those that introduced defects and those that did not. Our method considers individual changes in the commit separately, at the method-level granularity. Doing so helps developers as they are informed of specific methods that need further attention instead of being told that the entire commit is problematic. Our approach represents source code as abstract syntax trees and uses tree kernels to estimate the similarity of the code with previous commits. We experiment with subtree kernels (STK), subset tree kernels (SSTK), or partial tree kernels (PTK). An incoming change is then classified using a K-NN classifier on the past changes. We evaluate our approach on the BigCloneBench benchmark and on the Technical Debt dataset, using the NiCad clone detector as the baseline. Our experiments with the BigCloneBench benchmark show that the tree kernel approach can detect clones with a comparable MAP to that of NiCad. Also, on defect detection with the Technical Debt dataset, tree kernels are least as effective as NiCad with MRR, F-score, and Accuracy of 0.87, 0.80, and 0.82 respectively.
SEMar 23, 2021
Revisiting Dockerfiles in Open Source Software Over TimeKalvin Eng, Abram Hindle
Docker is becoming ubiquitous with containerization for developing and deploying applications. Previous studies have analyzed Dockerfiles that are used to create container images in order to better understand how to improve Docker tooling. These studies obtain Dockerfiles using either Docker Hub or Github. In this paper, we revisit the findings of previous studies using the largest set of Dockerfiles known to date with over 9.4 million unique Dockerfiles found in the World of Code infrastructure spanning from 2013-2020. We contribute a historical view of the Dockerfile format by analyzing the Docker engine changelogs and use the history to enhance our analysis of Dockerfiles. We also reconfirm previous findings of a downward trend in using OS images and an upward trend of using language images. As well, we reconfirm that Dockerfile smell counts are slightly decreasing meaning that Dockerfile authors are likely getting better at following best practices. Based on these findings, it indicates that previous analyses from prior works have been correct in many of their findings and their suggestions to build better tools for Docker image creation are further substantiated.
SPOct 21, 2020
Multilabel 12-Lead Electrocardiogram Classification Using Gradient Boosting Tree EnsembleAlexander William Wong, Weijie Sun, Sunil Vasu Kalmady et al.
The 12-lead electrocardiogram (ECG) is a commonly used tool for detecting cardiac abnormalities such as atrial fibrillation, blocks, and irregular complexes. For the PhysioNet/CinC 2020 Challenge, we built an algorithm using gradient boosted tree ensembles fitted on morphology and signal processing features to classify ECG diagnosis. For each lead, we derive features from heart rate variability, PQRST template shape, and the full signal waveform. We join the features of all 12 leads to fit an ensemble of gradient boosting decision trees to predict probabilities of ECG instances belonging to each class. We train a phase one set of feature importance determining models to isolate the top 1,000 most important features to use in our phase two diagnosis prediction models. We use repeated random sub-sampling by splitting our dataset of 43,101 records into 100 independent runs of 85:15 training/validation splits for our internal evaluation results. Our methodology generates us an official phase validation set score of 0.476 and test set score of -0.080 under the team name, CVC, placing us 36 out of 41 in the rankings.
SENov 14, 2019
On the Time-Based Conclusion Stability of Cross-Project Defect Prediction ModelsAbdul Ali Bangash, Hareem Sahar, Abram Hindle et al.
Researchers in empirical software engineering often make claims based on observable data such as defect reports. Unfortunately, in many cases, these claims are generalized beyond the data sets that have been evaluated. Will the researcher's conclusions hold a year from now for the same software projects? Perhaps not. Recent studies show that in the area of Software Analytics, conclusions over different data sets are usually inconsistent. In this article, we empirically investigate whether conclusions in the area of defect prediction truly exhibit stability throughout time or not. Our investigation applies a time-aware evaluation approach where models are trained only on the past, and evaluations are executed only on the future. Through this time-aware evaluation, we show that depending on which time period we evaluate defect predictors, their performance, in terms of F-Score, the area under the curve (AUC), and Mathews Correlation Coefficient (MCC), varies and their results are not consistent. The next release of a product, which is significantly different from its prior release, may drastically change defect prediction performance. Therefore, without knowing about the conclusion stability, empirical software engineering researchers should limit their claims of performance within the contexts of evaluation, because broad claims about defect prediction performance might be contradicted by the next upcoming release of a product under analysis.
SEJul 17, 2019
Syntax and Stack Overflow: A methodology for extracting a corpus of syntax errors and fixesAlexander William Wong, Amir Salimi, Shaiful Chowdhury et al.
One problem when studying how to find and fix syntax errors is how to get natural and representative examples of syntax errors. Most syntax error datasets are not free, open, and public, or they are extracted from novice programmers and do not represent syntax errors that the general population of developers would make. Programmers of all skill levels post questions and answers to Stack Overflow which may contain snippets of source code along with corresponding text and tags. Many snippets do not parse, thus they are ripe for forming a corpus of syntax errors and corrections. Our primary contribution is an approach for extracting natural syntax errors and their corresponding human made fixes to help syntax error research. A Python abstract syntax tree parser is used to determine preliminary errors and corrections on code blocks extracted from the SOTorrent data set. We further analyzed our code by executing the corrections in a Python interpreter. We applied our methodology to produce a public data set of 62,965 Python Stack Overflow code snippets with corresponding tags, errors, and stack traces. We found that errors made by Stack Overflow users do not match errors made by student developers or random mutations, implying there is a serious representativeness risk within the field. Finally we share our dataset openly so that future researchers can re-use and extend our syntax errors and fixes.
SEJul 10, 2019
Executability of Python Snippets in Stack OverflowMd Monir Hossain, Nima Mahmoudi, Changyuan Lin et al.
Online resources today contain an abundant amount of code snippets for documentation, collaboration, learning, and problem-solving purposes. Their executability in a "plug and play" manner enables us to confirm their quality and use them directly in projects. But, in practice that is often not the case due to several requirements violations or incompleteness. However, it is a difficult task to investigate the executability on a large scale due to different possible errors during the execution. We have developed a scalable framework to investigate this for SOTorrent Python snippets. We found that with minor adjustments, 27.92% of snippets are executable. The executability has not changed significantly over time. The code snippets referenced in GitHub are more likely to be directly executable. But executability does not affect the chances of the answer to be selected as the accepted answer significantly. These properties help us understand and improve the interaction of users with online resources that include code snippets.
IRApr 15, 2019
Tracing Forum Posts to MOOC Content using Topic AnalysisAlexander William Wong, Ken Wong, Abram Hindle
Massive Open Online Courses are educational programs that are open and accessible to a large number of people through the internet. To facilitate learning, MOOC discussion forums exist where students and instructors communicate questions, answers, and thoughts related to the course. The primary objective of this paper is to investigate tracing discussion forum posts back to course lecture videos and readings using topic analysis. We utilize both unsupervised and supervised variants of Latent Dirichlet Allocation (LDA) to extract topics from course material and classify forum posts. We validate our approach on posts bootstrapped from five Coursera courses and determine that topic models can be used to map student discussion posts back to the underlying course lecture or reading. Labeled LDA outperforms unsupervised Hierarchical Dirichlet Process LDA and base LDA for our traceability task. This research is useful as it provides an automated approach for clustering student discussions by course material, enabling instructors to quickly evaluate student misunderstanding of content and clarify materials accordingly.
SEJul 31, 2018
Sourcerer's Apprentice and the study of code snippet migrationStephen Romansky, Cheng Chen, Baljeet Malhotra et al.
On the worldwide web, not only are webpages connected but source code is too. Software development is becoming more accessible to everyone and the licensing for software remains complicated. We need to know if software licenses are being maintained properly throughout their reuse and evolution. This motivated the development of the Sourcerer's Apprentice, a webservice that helps track clone relicensing, because software typically employ software licenses to describe how their software may be used and adapted. But most developers do not have the legal expertise to sort out license conflicts. In this paper we put the Apprentice to work on empirical studies that demonstrate there is much sharing between StackOverflow code and Python modules and Python documentation that violates the licensing of the original Python modules and documentation: software snippets shared through StackOverflow are often being relicensed improperly to CC-BY-SA 3.0 without maintaining the appropriate attribution. We show that many snippets on StackOverflow are inappropriately relicensed by StackOverflow users, jeopardizing the status of the software built by companies and developers who reuse StackOverflow snippets.