Kathryn T. Stolee

SE
7papers
175citations
Novelty34%
AI Score39

7 Papers

52.1SEMar 16Code
Test Code Review in the Era of GitHub Actions: A Replication Study

Hui Sun, Yinan Wu, Wesley K. G. Assunção et al.

Test code is indispensable in software development, ensuring the correctness of production code and supporting maintainability. Nonetheless, errors or omissions in the test code can conceal production defects. While code review is widely adopted to assess code quality and correctness, little research has examined how test code is reviewed. Spadini et al.'s research on Gerrit (a pre-commit review model) found that test code receives significantly less discussion than production code. However, the most popular review model is currently based on pull requests (PRs), in which contributors propose changes for discussion and approval, a more negotiable and flexible model compared to Gerrit. Furthermore, GitHub Actions (GHA) has become widely used to automate pre-checks and testing, potentially impacting review practices. This leads us to explore whether Spadini et al.'s findings still hold for the PR model in the era of GHA? Our work replicates and extends their work. We focus on GitHub PRs and analyze six open-source projects. We investigate the impact of the PR model and GHA on test code review. Our results show that GitHub's PR model fosters more balanced discussions between test and production files than Gerrit, albeit with lower overall comment density. However, despite cross-project heterogeneity, GHA adoption triggered a sharp pivot toward production code. Post-GHA, for PRs involving tests, both review probability and comment density reached a median of zero. These findings reveal how evolving continuous integration pipelines can marginalize test code review. The observed decline in test-centric discussion under GHA warrants concern regarding long-term software quality. Our work also presents recommendations for stakeholders involved in the software development life cycle.

SEJun 16, 2021Code
Cross-Language Code Search using Static and Dynamic Analyses

George Mathew, Kathryn T. Stolee

As code search permeates most activities in software development,code-to-code search has emerged to support using code as a query and retrieving similar code in the search results. Applications include duplicate code detection for refactoring, patch identification for program repair, and language translation. Existing code-to-code search tools rely on static similarity approaches such as the comparison of tokens and abstract syntax trees (AST) to approximate dynamic behavior, leading to low precision. Most tools do not support cross-language code-to-code search, and those that do, rely on machine learning models that require labeled training data. We present Code-to-Code Search Across Languages (COSAL), a cross-language technique that uses both static and dynamic analyses to identify similar code and does not require a machine learning model. Code snippets are ranked using non-dominated sorting based on code token similarity, structural similarity, and behavioral similarity. We empirically evaluate COSAL on two datasets of 43,146Java and Python files and 55,499 Java files and find that 1) code search based on non-dominated ranking of static and dynamic similarity measures is more effective compared to single or weighted measures; and 2) COSAL has better precision and recall compared to state-of-the-art within-language and cross-language code-to-code search tools. We explore the potential for using COSAL on large open-source repositories and discuss scalability to more languages and similarity metrics, providing a gateway for practical,multi-language code-to-code search.

LGOct 25, 2021
Fair Enough: Searching for Sufficient Measures of Fairness

Suvodeep Majumder, Joymallya Chakraborty, Gina R. Bai et al.

Testing machine learning software for ethical bias has become a pressing current concern. In response, recent research has proposed a plethora of new fairness metrics, for example, the dozens of fairness metrics in the IBM AIF360 toolkit. This raises the question: How can any fairness tool satisfy such a diverse range of goals? While we cannot completely simplify the task of fairness testing, we can certainly reduce the problem. This paper shows that many of those fairness metrics effectively measure the same thing. Based on experiments using seven real-world datasets, we find that (a) 26 classification metrics can be clustered into seven groups, and (b) four dataset metrics can be clustered into three groups. Further, each reduced set may actually predict different things. Hence, it is no longer necessary (or even possible) to satisfy all fairness metrics. In summary, to simplify the fairness testing problem, we recommend the following steps: (1)~determine what type of fairness is desirable (and we offer a handful of such types); then (2) lookup those types in our clusters; then (3) just test for one item per cluster.

SEApr 19, 2021
Demystifying Regular Expression Bugs: A comprehensive study on regular expression bug causes, fixes, and testing

Peipei Wang, Chris Brown, Jamie A. Jennings et al.

Regular expressions cause string-related bugs and open security vulnerabilities for DOS attacks. However, beyond ReDoS (Regular expression Denial of Service), little is known about the extent to which regular expression issues affect software development and how these issues are addressed in practice. We conduct an empirical study of 356 merged regex-related pull request bugs from Apache, Mozilla, Facebook, and Google GitHub repositories. We identify and classify the nature of the regular expression problems, the fixes, and the related changes in the test code. The most important findings in this paper are as follows: 1) incorrect regular expression behavior is the dominant root cause of regular expression bugs (165/356, 46.3%). The remaining root causes are incorrect API usage (9.3%) and other code issues that require regular expression changes in the fix (29.5%), 2) fixing regular expression bugs is nontrivial as it takes more time and more lines of code to fix them compared to the general pull requests, 3) most (51%) of the regex-related pull requests do not contain test code changes. Certain regex bug types (e.g., compile error, performance issues, regex representation) are less likely to include test code changes than others, and 4) the dominant type of test code changes in regex-related pull requests is test case addition (75%). The results of this study contribute to a broader understanding of the practical problems faced by developers when using, fixing, and testing regular expressions.

SEFeb 10, 2021
SQLRepair: Identifying and Repairing Mistakes in Student-Authored SQL Queries

Kai Presler-Marshall, Sarah Heckman, Kathryn T. Stolee

Computer science educators seek to understand the types of mistakes that students make when learning a new (programming) language so that they can help students avoid those mistakes in the future. While educators know what mistakes students regularly make in languages such as C and Python, students struggle with SQL and regularly make mistakes when working with it. We present an analysis of mistakes that students made when first working with SQL, classify the types of errors introduced, and provide suggestions on how to avoid them going forward. In addition, we present an automated tool, SQLRepair, that is capable of repairing errors introduced by undergraduate programmers when writing SQL queries. Our results show that students find repairs produced by our tool comparable in understandability to queries written by themselves or by other students, suggesting that SQL repair tools may be useful in an educational context. We also provide to the community a benchmark of SQL queries written by the students in our study that we used for evaluation of SQLRepair.

SEMar 22, 2018
Evaluating How Developers Use General-Purpose Web-Search for Code Retrieval

Md Masudur Rahman, Jed Barson, Sydney Paul et al.

Search is an integral part of a software development process. Developers often use search engines to look for information during development, including reusable code snippets, API understanding, and reference examples. Developers tend to prefer general-purpose search engines like Google, which are often not optimized for code related documents and use search strategies and ranking techniques that are more optimized for generic, non-code related information. In this paper, we explore whether a general purpose search engine like Google is an optimal choice for code-related searches. In particular, we investigate whether the performance of searching with Google varies for code vs. non-code related searches. To analyze this, we collect search logs from 310 developers that contains nearly 150,000 search queries from Google and the associated result clicks. To differentiate between code-related searches and non-code related searches, we build a model which identifies the code intent of queries. Leveraging this model, we build an automatic classifier that detects a code and non-code related query. We confirm the effectiveness of the classifier on manually annotated queries where the classifier achieves a precision of 87%, a recall of 86%, and an F1-score of 87%. We apply this classifier to automatically annotate all the queries in the dataset. Analyzing this dataset, we observe that code related searching often requires more effort (e.g., time, result clicks, and query modifications) than general non-code search, which indicates code search performance with a general search engine is less effective.

SEFeb 27, 2017
Replicating and Scaling up Qualitative Analysis using Crowdsourcing: A Github-based Case Study

Di Chen, Kathryn T. Stolee, Tim Menzies

Due to the difficulties in replicating and scaling up qualitative studies, such studies are rarely verified. Accordingly, in this paper, we leverage the advantages of crowdsourcing (low costs, fast speed, scalable workforce) to replicate and scale-up one state-of-the-art qualitative study. That qualitative study explored 20 GitHub pull requests to learn factors that influence the fate of pull requests with respect to approval and merging. As a secondary study, using crowdsourcing at a cost of $200, we studied 250 pull requests from 142 GitHub projects. The prior qualitative findings are mapped into questions for crowds workers. Their answers were converted into binary features to build a predictor which predicts whether code would be merged with median F1 scores of 68%. For the same large group of pull requests, the median F1 scores could achieve 90% by a predictor built with additional features defined by prior quantitative results. Based on this case study, we conclude that there is much benefit in combining different kinds of research methods. While qualitative insights are very useful for finding novel insights, they can be hard to scale or replicate. That said, they can guide and define the goals of scalable secondary studies that use (e.g.) crowdsourcing+data mining. On the other hand, while data mining methods are reproducible and scalable to large data sets, their results may be spectacularly wrong since they lack contextual information. That said, they can be used to test the stability and external validity, of the insights gained from a qualitative analysis.