Audris Mockus

SE
h-index52
24papers
401citations
Novelty35%
AI Score42

24 Papers

LGAug 31, 2024
Identifying Factors to Help Improve Existing Decomposition-Based PMI Estimation Methods

Anna-Maria Nau, Phillip Ditto, Dawnie Wolfe Steadman et al. · meta-ai

Accurately assessing the postmortem interval (PMI) is an important task in forensic science. Some of the existing techniques use regression models that use a decomposition score to predict the PMI or accumulated degree days (ADD), however, the provided formulas are based on very small samples and the accuracy is low. With the advent of Big Data, much larger samples can be used to improve PMI estimation methods. We, therefore, aim to investigate ways to improve PMI prediction accuracy by (a) using a much larger sample size, (b) employing more advanced linear models, and (c) enhancing models with factors known to affect the human decay process. Specifically, this study involved the curation of a sample of 249 human subjects from a large-scale decomposition dataset, followed by evaluating pre-existing PMI/ADD formulas and fitting increasingly sophisticated models to estimate the PMI/ADD. Results showed that including the total decomposition score (TDS), demographic factors (age, biological sex, and BMI), and weather-related factors (season of discovery, temperature history, and humidity history) increased the accuracy of the PMI/ADD models. Furthermore, the best performing PMI estimation model using the TDS, demographic, and weather-related features as predictors resulted in an adjusted R-squared of 0.34 and an RMSE of 0.95. It had a 7% lower RMSE than a model using only the TDS to predict the PMI and a 48% lower RMSE than the pre-existing PMI formula. The best ADD estimation model, also using the TDS, demographic, and weather-related features as predictors, resulted in an adjusted R-squared of 0.52 and an RMSE of 0.89. It had an 11% lower RMSE than the model using only the TDS to predict the ADD and a 52% lower RMSE than the pre-existing ADD formula. This work demonstrates the need (and way) to incorporate demographic and environmental factors into PMI/ADD estimation models.

CVAug 19, 2024
Towards Automation of Human Stage of Decay Identification: An Artificial Intelligence Approach

Anna-Maria Nau, Phillip Ditto, Dawnie Wolfe Steadman et al. · meta-ai

Determining the stage of decomposition (SOD) is crucial for estimating the postmortem interval and identifying human remains. Currently, labor-intensive manual scoring methods are used for this purpose, but they are subjective and do not scale for the emerging large-scale archival collections of human decomposition photos. This study explores the feasibility of automating two common human decomposition scoring methods proposed by Megyesi and Gelderman using artificial intelligence (AI). We evaluated two popular deep learning models, Inception V3 and Xception, by training them on a large dataset of human decomposition images to classify the SOD for different anatomical regions, including the head, torso, and limbs. Additionally, an interrater study was conducted to assess the reliability of the AI models compared to human forensic examiners for SOD identification. The Xception model achieved the best classification performance, with macro-averaged F1 scores of .878, .881, and .702 for the head, torso, and limbs when predicting Megyesi's SODs, and .872, .875, and .76 for the head, torso, and limbs when predicting Gelderman's SODs. The interrater study results supported AI's ability to determine the SOD at a reliability level comparable to a human expert. This work demonstrates the potential of AI models trained on a large dataset of human decomposition images to automate SOD identification.

72.8SEMay 28
Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency

Chris Adams, Arjun Singh Banga, Parveen Bansal et al.

AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year over year and per-developer diff volume rose 51%, with agentic AI responsible for over 80% of that growth. Meanwhile, the share of diffs receiving timely review has declined, exposing a widening gap between code supply and reviewer bandwidth. We ask three questions that progress from feasibility through calibration to impact: (1) can risk-stratified automation operate at scale across diverse organizations, (2) how does tuning the risk threshold affect the trade-off between automation yield and safety, and (3) to what extent does automated review reduce end-to-end latency for AI-generated changes? We deployed RADAR (Risk Aware Diff Auto Review), a multi-stage funnel that classifies each diff by authorship and source type, applies eligibility gates, static heuristics, a machine-learned Diff Risk Score, LLM-based Automated Code Review, and deterministic validation before landing qualifying changes. We evaluate RADAR through telemetry covering 535K+ RADAR-reviewed diffs, observational before-after comparisons for policy changes, and difference-in-differences analysis of efficiency outcomes. RADAR has reviewed 535K+ diffs and landed 331K+. Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%. The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs, and the Production Incident rate is 1/50 that of non-RADAR diffs. RADAR reduces median time to close by over 330% and median diff review wall time by 35%. Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.

SEJan 5, 2025Code
Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets

Mahmoud Jahanshahi, Audris Mockus · meta-ai

A critical part of creating code suggestion systems is the pre-training of Large Language Models on vast amounts of source code and natural language text, often of questionable origin or quality. This may contribute to the presence of bugs and vulnerabilities in code generated by LLMs. While efforts to identify bugs at or after code generation exist, it is preferable to pre-train or fine-tune LLMs on curated, high-quality, and compliant datasets. The need for vast amounts of training data necessitates that such curation be automated, minimizing human intervention. We propose an automated source code autocuration technique that leverages the complete version history of open-source software projects to improve the quality of training data. This approach leverages the version history of all OSS projects to identify training data samples that have been modified or have undergone changes in at least one OSS project, and pinpoint a subset of samples that include fixes for bugs or vulnerabilities. We evaluate this method using The Stack v2 dataset, and find that 17% of the code versions in the dataset have newer versions, with 17% of those representing bug fixes, including 2.36% addressing known CVEs. The deduplicated version of Stack v2 still includes blobs vulnerable to 6,947 known CVEs. Furthermore, 58% of the blobs in the dataset were never modified after creation, suggesting they likely represent software with minimal or no use. Misidentified blob origins present an additional challenge, as they lead to the inclusion of non-permissively licensed code, raising serious compliance concerns. By addressing these issues, the training of new models can avoid perpetuating buggy code patterns or license violations. We expect our results to inspire process improvements for automated data curation, with the potential to enhance the reliability of outputs generated by AI tools.

SEOct 30, 2020Code
World of Code: Enabling a Research Workflow for Mining and Analyzing the Universe of Open Source VCS data

Yuxing Ma, Tapajit Dey, Chris Bogart et al.

Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are the tens of millions of projects in the periphery interconnected through. technical dependencies, code sharing, or knowledge flow? To answer such questions we: a) create a very large and frequently updated collection of version control data in the entire FLOSS ecosystems named World of Code (WoC), that can completely cross-reference authors, projects, commits, blobs, dependencies, and history of the FLOSS ecosystems and b) provide capabilities to efficiently correct, augment, query, and analyze that data. Our current WoC implementation is capable of being updated on a monthly basis and contains over 18B Git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.

SEJul 8, 2020Code
Effect of Technical and Social Factors on Pull Request Quality for the NPM Ecosystem

Tapajit Dey, Audris Mockus

Pull request (PR) based development, which is a norm for the social coding platforms, entails the challenge of evaluating the contributions of, often unfamiliar, developers from across the open source ecosystem and, conversely, submitting a contribution to a project with unfamiliar maintainers. Previous studies suggest that the decision of accepting or rejecting a PR may be influenced by a diverging set of technical and social factors, but often focus on relatively few projects, do not consider ecosystem-wide measures, or the possible non-monotonic relationships between the predictors and PR acceptance probability. We aim to shed light on this important decision making process by testing which measures significantly affect the probability of PR acceptance on a significant fraction of a large ecosystem, rank them by their relative importance in predicting PR acceptance, and determine the shape of the functions that map each predictor to PR acceptance. We proposed seven hypotheses regarding which technical and social factors might affect PR acceptance and created 17 measures based on them. Our dataset consisted of 470,925 PRs from 3349 popular NPM packages and 79,128 GitHub users who created those. We tested which of the measures affect PR acceptance and ranked the significant measures by their importance in a predictive model. Our predictive model had and AUC of 0.94, and 15 of the 17 measures were found to matter, including five novel ecosystem-wide measures. Measures describing the number of PRs submitted to a repository and what fraction of those get accepted, and signals about the PR review phase were most significant. We also discovered that only four predictors have a linear influence on the PR acceptance probability while others showed a more complicated response.

SEMay 20, 2020Code
Representation of Developer Expertise in Open Source Software

Tapajit Dey, Andrey Karnauch, Audris Mockus

Background: Accurate representation of developer expertise has always been an important research problem. While a number of studies proposed novel methods of representing expertise within individual projects, these methods are difficult to apply at an ecosystem level. However, with the focus of software development shifting from monolithic to modular, a method of representing developers' expertise in the context of the entire OSS development becomes necessary when, for example, a project tries to find new maintainers and look for developers with relevant skills. Aim: We aim to address this knowledge gap by proposing and constructing the Skill Space where each API, developer, and project is represented and postulate how the topology of this space should reflect what developers know (and projects need). Method: we use the World of Code infrastructure to extract the complete set of APIs in the files changed by open source developers and, based on that data, employ Doc2Vec embeddings for vector representations of APIs, developers, and projects. We then evaluate if these embeddings reflect the postulated topology of the Skill Space by predicting what new APIs/projects developers use/join, and whether or not their pull requests get accepted. We also check how the developers' representations in the Skill Space align with their self-reported API expertise. Result: Our results suggest that the proposed embeddings in the Skill Space appear to satisfy the postulated topology and we hope that such representations may aid in the construction of signals that increase trust (and efficiency) of open source ecosystems at large and may aid investigations of other phenomena related to developer proficiency and learning.

SEMar 18, 2020Code
A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits

Tanner Fry, Tapajit Dey, Andrey Karnauch et al.

The data collected from open source projects provide means to model large software ecosystems, but often suffer from data quality issues, specifically, multiple author identification strings in code commits might actually be associated with one developer. While many methods have been proposed for addressing this problem, they are either heuristics requiring manual tweaking, or require too much calculation time to do pairwise comparisons for 38M author IDs in, for example, the World of Code collection. In this paper, we propose a method that finds all author IDs belonging to a single developer in this entire dataset, and share the list of all author IDs that were found to have aliases. To do this, we first create blocks of potentially connected author IDs and then use a machine learning model to predict which of these potentially related IDs belong to the same developer. We processed around 38 million author IDs and found around 14.8 million IDs to have an alias, which belong to 5.4 million different developers, with the median number of aliases being 2 per developer. This dataset can be used to create more accurate models of developer behaviour at the entire OSS ecosystem level and can be used to provide a service to rapidly resolve new author IDs.

SEAug 29, 2019Code
A Methodology for Analyzing Uptake of Software Technologies Among Developers

Yuxing Ma, Audris Mockus, Beth Milhollin et al.

Motivation: The question of what combination of attributes drives the adoption of a particular software technology is critical to developers. It determines both those technologies that receive wide support from the community and those which may be abandoned, thus rendering developers' investments worthless. Aim and Context: We model software technology adoption by developers and provide insights on specific technology attributes that are associated with better visibility among alternative technologies. Approach: We leverage social contagion theory and statistical modeling to identify, define, and test empirically measures that are likely to affect software adoption. More specifically, we leverage a large collection of open source version control repositories to construct a software dependency chain for a specific set of R language source-code files. We formulate logistic regression models, to investigate the combination of technological attributes that drive adoption among competing data frame implementations in the R language: tidy and data.table. We quantify key project attributes that might affect adoption and also characteristics of developers making the selection. Results: We find that a quick response to raised issues, a larger number of overall deployments, and a larger number of high-quality StackExchange questions are associated with higher adoption. Decision makers tend to adopt the technology that is closer to them in the technical dependency network and in author collaborations networks while meeting their performance needs. Future work: We hope that our methodology encompassing social contagion that captures both rational and irrational preferences and the elucidation of key measures from large collections of version control data provides a general path toward increasing visibility, driving better informed decisions, and producing more sustainable and widely adopted software

SEJul 15, 2019Code
Patterns of Effort Contribution and Demand and User Classification based on Participation Patterns in NPM Ecosystem

Tapajit Dey, Yuxing Ma, Audris Mockus

Background: Open source requires participation of volunteer and commercial developers (users) in order to deliver functional high-quality components. Developers both contribute effort in the form of patches and demand effort from the component maintainers to resolve issues reported against it. Aim: Identify and characterize patterns of effort contribution and demand throughout the open source supply chain and investigate if and how these patterns vary with developer activity; identify different groups of developers; and predict developers' company affiliation based on their participation patterns. Method: 1,376,946 issues and pull-requests created for 4433 NPM packages with over 10,000 monthly downloads and full (public) commit activity data of the 272,142 issue creators is obtained and analyzed and dependencies on NPM packages are identified. Fuzzy c-means clustering algorithm is used to find the groups among the users based on their effort contribution and demand patterns, and Random Forest is used as the predictive modeling technique to identify their company affiliations. Result: Users contribute and demand effort primarily from packages that they depend on directly with only a tiny fraction of contributions and demand going to transitive dependencies. A significant portion of demand goes into packages outside the users' respective supply chains (constructed based on publicly visible version control data). Three and two different groups of users are observed based on the effort demand and effort contribution patterns respectively. The Random Forest model used for identifying the company affiliation of the users gives a AUC-ROC value of 0.68. Conclusion: Our results give new insights into effort demand and supply at different parts of the supply chain of the NPM ecosystem and its users and suggests the need to increase visibility further upstream.

SEJan 10, 2019Code
ALFAA: Active Learning Fingerprint Based Anti-Aliasing for Correcting Developer Identity Errors in Version Control Data

Sadika Amreen, Audris Mockus, Chris Bogart et al.

Graphs of developer networks are important for software engineering research and practice. For these graphs to realistically represent the networks, accurate developer identities are imperative. We aim to identify developer identity errors from open source software repositories in VCS, investigate the nature of these errors, design corrective algorithms, and estimate the impact of the errors on networks inferred from this data. We investigate these questions using over 1B Git commits with over 23M recorded author identities. By inspecting the author strings that occur most frequently, we group identity errors into categories. We then augment the author strings with 3 behavioral fingerprints: time-zone frequencies, the set of files modified, and a vector embedding of the commit messages. We create a manually validated set of identities for a subset of OpenStack developers using an active learning approach and use it to fit supervised learning models to predict the identities for the remaining author strings in OpenStack. We compare these predictions with a commercial effort and a leading research method. Finally, we compare network measures for file-induced author networks based on corrected and raw data. We find commits done from different environments, misspellings, organizational IDs, default values, and anonymous IDs to be the major sources of errors. We also find supervised learning methods to reduce errors by several times in comparison to existing methods and the active learning approach to be an effective way to create validated datasets and that correction of developer identity has a large impact on the inference of the social network. We believe that our proposed Active Learning Fingerprint Based Anti-Aliasing (ALFAA) approach will expedite research progress in the software engineering domain for applications that depend upon graphs of developers or other social networks.

CVFeb 24, 2022
SLRNet: Semi-Supervised Semantic Segmentation Via Label Reuse for Human Decomposition Images

Sara Mousavi, Zhenning Yang, Kelley Cross et al.

Semantic segmentation is a challenging computer vision task demanding a significant amount of pixel-level annotated data. Producing such data is a time-consuming and costly process, especially for domains with a scarcity of experts, such as medicine or forensic anthropology. While numerous semi-supervised approaches have been developed to make the most from the limited labeled data and ample amount of unlabeled data, domain-specific real-world datasets often have characteristics that both reduce the effectiveness of off-the-shelf state-of-the-art methods and also provide opportunities to create new methods that exploit these characteristics. We propose and evaluate a semi-supervised method that reuses available labels for unlabeled images of a dataset by exploiting existing similarities, while dynamically weighting the impact of these reused labels in the training process. We evaluate our method on a large dataset of human decomposition images and find that our method, while conceptually simple, outperforms state-of-the-art consistency and pseudo-labeling-based methods for the segmentation of this dataset. This paper includes graphic content of human decomposition.

CVMay 20, 2021
Pseudo Pixel-level Labeling for Images with Evolving Content

Sara Mousavi, Zhenning Yang, Kelley Cross et al.

Annotating images for semantic segmentation requires intense manual labor and is a time-consuming and expensive task especially for domains with a scarcity of experts, such as Forensic Anthropology. We leverage the evolving nature of images depicting the decay process in human decomposition data to design a simple yet effective pseudo-pixel-level label generation technique to reduce the amount of effort for manual annotation of such images. We first identify sequences of images with a minimum variation that are most suitable to share the same or similar annotation using an unsupervised approach. Given one user-annotated image in each sequence, we propagate the annotation to the remaining images in the sequence by merging it with annotations produced by a state-of-the-art CAM-based pseudo label generation technique. To evaluate the quality of our pseudo-pixel-level labels, we train two semantic segmentation models with VGG and ResNet backbones on images labeled using our pseudo labeling method and those of a state-of-the-art method. The results indicate that using our pseudo-labels instead of those generated using the state-of-the-art method in the training process improves the mean-IoU and the frequency-weighted-IoU of the VGG and ResNet-based semantic segmentation models by 3.36%, 2.58%, 10.39%, and 12.91% respectively.

SEMar 1, 2021
The Secret Life of Hackathon Code

Ahmed Imam, Tapajit Dey, Alexander Nolte et al.

Background: Hackathons have become popular events for teams to collaborate on projects and develop software prototypes. Most existing research focuses on activities during an event with limited attention to the evolution of the code brought to or created during a hackathon. Aim: We aim to understand the evolution of hackathon-related code, specifically, how much hackathon teams rely on pre-existing code or how much new code they develop during a hackathon. Moreover, we aim to understand if and where that code gets reused, and what factors affect reuse. Method: We collected information about 22,183 hackathon projects from DEVPOST -- a hackathon database -- and obtained related code (blobs), authors, and project characteristics from the World of Code. We investigated if code blobs in hackathon projects were created before, during, or after an event by identifying the original blob creation date and author, and also checked if the original author was a hackathon project member. We tracked code reuse by first identifying all commits containing blobs created during an event before determining all projects that contain those commits. Result: While only approximately 9.14% of the code blobs are created during hackathons, this amount is still significant considering the time and member constraints of such events. Approximately a third of these code blobs get reused in other projects. The number of associated technologies and the number of participants in a project increase reuse probability. Conclusion: Our study demonstrates to what extent pre-existing code is used and new code is created during a hackathon and how much of it is reused elsewhere afterwards. Our findings help to better understand code reuse as a phenomenon and the role of hackathons in this context and can serve as a starting point for further studies in this area.

SEAug 8, 2020
More Effective Software Repository Mining

Adam Tutko, Austin Henley, Audris Mockus

Background: Data mining and analyzing of public Git software repositories is a growing research field. The tools used for studies that investigate a single project or a group of projects have been refined, but it is not clear whether the results obtained on such ``convenience samples'' generalize. Aims: This paper aims to elucidate the difficulties faced by researchers who would like to ascertain the generalizability of their findings by introducing an interface that addresses the issues with obtaining representative samples. Results: To do that we explore how to exploit the World of Code system to make software repository sampling and analysis much more accessible. Specifically, we present a resource for Mining Software Repository researchers that is intended to simplify data sampling and retrieval workflow and, through that, increase the validity and completeness of data. Conclusions: This system has the potential to provide researchers a resource that greatly eases the difficulty of data retrieval and addresses many of the currently standing issues with data sampling.

SEMay 19, 2020
Do Code Review Measures Explain the Incidence of Post-Release Defects?

Andrey Krutauz, Tapajit Dey, Peter C. Rigby et al.

Aim: In contrast to studies of defects found during code review, we aim to clarify whether code reviews measures can explain the prevalence of post-release defects. Method: We replicate a study by McIntoshet. al that uses additive regression to model the relationship between defects and code reviews. To increase external validity, we apply the same methodology on a new software project. We discuss our findings with the first author of the original study, McIntosh. We then investigate how to reduce the impact of correlated predictors in the variable selection process and how to increase understanding of the inter-relationships among the predictors by employing Bayesian Network (BN) models. Context: As in the original study, we use the same measures authors obtained for Qt project in the original study. We mine data from version control and issue tracker of Google Chrome and operationalize measures that are close analogs to the large collection of code, process, and code review measures used in the replicated the study. Results: Both the data from the original study and the Chrome data showed high instability of the influence of code review measures on defects with the results being highly sensitive to variable selection procedure. Models without code review predictors had as good or better fit than those with review predictors. Replication, however, confirms with the bulk of prior work showing that prior defects, module size, and authorship have the strongest relationship to post-release defects. The application of BN models helped explain the observed instability by demonstrating that the review-related predictors do not affect post-release defects directly and showed indirect effects. For example, changes that have no review discussion tend to be associated with files that have had many prior defects which in turn increase the number of post-release defects.

SEMar 17, 2020
An Exploratory Study of Bot Commits

Tapajit Dey, Bogdan Vasilescu, Audris Mockus

Background: Bots help automate many of the tasks performed by software developers and are widely used to commit code in various social coding platforms. At present, it is not clear what types of activities these bots perform and understanding it may help design better bots, and find application areas which might benefit from bot adoption. Aim: We aim to categorize the Bot Commits by the type of change (files added, deleted, or modified), find the more commonly changed file types, and identify the groups of file types that tend to get updated together. Method: 12,326,137 commits made by 461 popular bots (that made at least 1000 commits) were examined to identify the frequency and the type of files added/ deleted/ modified by the commits, and association rule mining was used to identify the types of files modified together. Result: Majority of the bot commits modify an existing file, a few of them add new files, while deletion of a file is very rare. Commits involving more than one type of operation are even rarer. Files containing data, configuration, and documentation are most frequently updated, while HTML is the most common type in terms of the number of files added, deleted, and modified. Files of the type "Markdown", "Ignore List", "YAML", "JSON" were the types that are updated together with other types of files most frequently. Conclusion: We observe that majority of bot commits involve single file modifications, and bots primarily work with data, configuration, and documentation files. A better understanding if this is a limitation of the bots and, if overcome, would lead to different kinds of bots remains an open question.

LGMar 9, 2020
Collaborative Learning of Semi-Supervised Clustering and Classification for Labeling Uncurated Data

Sara Mousavi, Dylan Lee, Tatianna Griffin et al.

Domain-specific image collections present potential value in various areas of science and business but are often not curated nor have any way to readily extract relevant content. To employ contemporary supervised image analysis methods on such image data, they must first be cleaned and organized, and then manually labeled for the nomenclature employed in the specific domain, which is a time consuming and expensive endeavor. To address this issue, we designed and implemented the Plud system. Plud provides an iterative semi-supervised workflow to minimize the effort spent by an expert and handles realistic large collections of images. We believe it can support labeling datasets regardless of their size and type. Plud is an iterative sequence of unsupervised clustering, human assistance, and supervised classification. With each iteration 1) the labeled dataset grows, 2) the generality of the classification method and its accuracy increases, and 3) manual effort is reduced. We evaluated the effectiveness of our system, by applying it on over a million images documenting human decomposition. In our experiment comparing manual labeling with labeling conducted with the support of Plud, we found that it reduces the time needed to label data and produces highly accurate models for this new domain.

SEMar 2, 2020
Detecting and Characterizing Bots that Commit Code

Tapajit Dey, Sara Mousavi, Eduardo Ponce et al.

Background: Some developer activity traditionally performed manually, such as making code commits, opening, managing, or closing issues is increasingly subject to automation in many OSS projects. Specifically, such activity is often performed by tools that react to events or run at specific times. We refer to such automation tools as bots and, in many software mining scenarios related to developer productivity or code quality it is desirable to identify bots in order to separate their actions from actions of individuals. Aim: Find an automated way of identifying bots and code committed by these bots, and to characterize the types of bots based on their activity patterns. Method and Result: We propose BIMAN, a systematic approach to detect bots using author names, commit messages, files modified by the commit, and projects associated with the ommits. For our test data, the value for AUC-ROC was 0.9. We also characterized these bots based on the time patterns of their code commits and the types of files modified, and found that they primarily work with documentation files and web pages, and these files are most prevalent in HTML and JavaScript ecosystems. We have compiled a shareable dataset containing detailed information about 461 bots we found (all of whom have more than 1000 commits) and 13,762,430 commits they created.

SEMar 2, 2020
Which Pull Requests Get Accepted and Why? A study of popular NPM Packages

Tapajit Dey, Audris Mockus

Background: Pull Request (PR) Integrators often face challenges in terms of multiple concurrent PRs, so the ability to gauge which of the PRs will get accepted can help them balance their workload. PR creators would benefit from knowing if certain characteristics of their PRs may increase the chances of acceptance. Aim: We modeled the probability that a PR will be accepted within a month after creation using a Random Forest model utilizing 50 predictors representing properties of the author, PR, and the project to which PR is submitted. Method: 483,988 PRs from 4218 popular NPM packages were analysed and we selected a subset of 14 predictors sufficient for a tuned Random Forest model to reach high accuracy. Result: An AUC-ROC value of 0.95 was achieved predicting PR acceptance. The model excluding PR properties that change after submission gave an AUC-ROC value of 0.89. We tested the utility of our model in practical scenarios by training it with historical data for the NPM package \textit{bootstrap} and predicting if the PRs submitted in future will be accepted. This gave us an AUC-ROC value of 0.94 with all 14 predictors, and 0.77 excluding PR properties that change after its creation. Conclusion: PR integrators can use our model for a highly accurate assessment of the quality of the open PRs and PR creators may benefit from the model by understanding which characteristics of their PRs may be undesirable from the integrators' perspective. The model can be implemented as a tool, which we plan to do as a future work.

SEFeb 23, 2020
Deriving a Usage-Independent Software Quality Metric

Tapajit Dey, Audris Mockus

Context:The extent of post-release use of software affects the number of faults, thus biasing quality metrics and adversely affecting associated decisions. The proprietary nature of usage data limited deeper exploration of this subject in the past. Objective: To determine how software faults and software use are related and how an accurate quality measure can be designed. Method: New users, usage intensity, usage frequency, exceptions, and release date and duration measured for complex proprietary mobile applications for Android and iOS. Utilized Bayesian Network and Random Forest models to explain the interrelationships and to derive the usage independent release quality measure. Investigated the interrelationship among various code complexity measures, usage (downloads), and number of issues for 520 NPM packages and derived a usage-independent quality measure from these analyses, applied it on 4430 popular NPM packages to construct timelines for comparing the perceived quality (issues) and our derived measure of quality for these packages.Results: We found the number of new users to be the primary factor determining the number of exceptions, and found no direct link between the intensity and frequency of software usage and software faults. Release quality expressed as crashes per user was independent of other usage-related predictors, thus serving as a usage independent measure of software quality. Usage also affected quality in NPM, where downloads were strongly associated with numbers of issues, even after taking the other code complexity measures into consideration. Conclusions: We expect our result and our proposed quality measure will help gauge release quality of a software more accurately and inspire further research in this area.

SEFeb 6, 2020
A Dataset for GitHub Repository Deduplication: Extended Description

Diomidis Spinellis, Zoe Kotti, Audris Mockus

GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.

CVDec 29, 2019
An Analytical Workflow for Clustering Forensic Images

Sara Mousavi, Dylan Lee, Tatianna Griffin et al.

Large collections of images, if curated, drastically contribute to the quality of research in many domains. Unsupervised clustering is an intuitive, yet effective step towards curating such datasets. In this work, we present a workflow for unsupervisedly clustering a large collection of forensic images. The workflow utilizes classic clustering on deep feature representation of the images in addition to domain-related data to group them together. Our manual evaluation shows a purity of 89\% for the resulted clusters.

CVFeb 28, 2019
Machine-assisted annotation of forensic imagery

Sara Mousavi, Ramin Nabati, Megan Kleeschulte et al.

Image collections, if critical aspects of image content are exposed, can spur research and practical applications in many domains. Supervised machine learning may be the only feasible way to annotate very large collections, but leading approaches rely on large samples of completely and accurately annotated images. In the case of a large forensic collection, we are aiming to annotate, neither the complete annotation nor the large training samples can be feasibly produced. We, therefore, investigate ways to assist manual annotation efforts done by forensic experts. We present a method that can propose both images and areas within an image likely to contain desired classes. Evaluation of the method with human annotators showed highly accurate classification that was strongly helped by transfer learning. The segmentation precision (mAP) was improved by adding a separate class capturing background, but that did not affect the recall (mAR). Further work is needed to both increase the accuracy of segmentation and enhances prediction with additional covariates affecting decomposition. We hope this effort to be of help in other domains that require weak segmentation and have limited availability of qualified annotators.