SEMar 7, 2023
From Copilot to Pilot: Towards AI Supported Software DevelopmentRohith Pudari, Neil A. Ernst
AI-supported programming has arrived, as shown by the introduction and successes of large language models for code, such as Copilot/Codex (Github/OpenAI) and AlphaCode (DeepMind). Above human average performance on programming challenges is now possible. However, software engineering is much more than solving programming contests. Moving beyond code completion to AI-supported software engineering will require an AI system that can, among other things, understand how to avoid code smells, to follow language idioms, and eventually (maybe!) propose rational software designs. In this study, we explore the current limitations of AI-supported code completion tools like Copilot and offer a simple taxonomy for understanding the classification of AI-supported code completion tools in this space. We first perform an exploratory study on Copilot's code suggestions for language idioms and code smells. Copilot does not follow language idioms and avoid code smells in most of our test scenarios. We then conduct additional investigation to determine the current boundaries of AI-supported code completion tools like Copilot by introducing a taxonomy of software abstraction hierarchies where 'basic programming functionality' such as code compilation and syntax checking is at the least abstract level, software architecture analysis and design are at the most abstract level. We conclude by providing a discussion on challenges for future development of AI-supported code completion tools to reach the design level of abstraction in our taxonomy.
37.1SEMar 20
The Nature of Technical Debt in Research SoftwareNeil A. Ernst, Ahmed Musa Awon, Swapnil Hingmire et al.
Research software (also called scientific software) is essential for advancing scientific endeavours. Research software encapsulates complex algorithms and domain-specific knowledge and is a fundamental component of all science. A pervasive challenge in developing research software is technical debt, which can adversely affect reliability, maintainability, and scientific validity. Research software often relies on the initiative of the scientific community for maintenance, requiring diverse expertise in both scientific and software engineering domains. The extent and nature of technical debt in research software are little studied, in particular, what forms it takes, and what the science teams developing this software think about their technical debt. In this paper we describe our multi-method study examining technical debt in research software. We begin by examining instances of self-reported technical debt in research code, examining 28k code comments across nine research software projects. Then, building on our findings, we interview research software engineers and scientists about how this technical debt manifests itself in their experience, and what costs it has for research software and research outputs more generally. We identify nine types of self-admitted technical debt unique to research software, and four themes impacting this technical debt.
SEMay 31, 2017Code
What to Fix? Distinguishing between design and non-design rules in automated toolsNeil A. Ernst, Stephany Bellomo, Ipek Ozkaya et al.
Technical debt---design shortcuts taken to optimize for delivery speed---is a critical part of long-term software costs. Consequently, automatically detecting technical debt is a high priority for software practitioners. Software quality tool vendors have responded to this need by positioning their tools to detect and manage technical debt. While these tools bundle a number of rules, it is hard for users to understand which rules identify design issues, as opposed to syntactic quality. This is important, since previous studies have revealed the most significant technical debt is related to design issues. Other research has focused on comparing these tools on open source projects, but these comparisons have not looked at whether the rules were relevant to design. We conducted an empirical study using a structured categorization approach, and manually classify 466 software quality rules from three industry tools---CAST, SonarQube, and NDepend. We found that most of these rules were easily labeled as either not design (55%) or design (19%). The remainder (26%) resulted in disagreements among the labelers. Our results are a first step in formalizing a definition of a design rule, in order to support automatic detection.
SEJun 17, 2021
Conclusion Stability for Natural Language Based Mining of Design DiscussionsAlvi Mahadi, Neil A. Ernst, Karan Tongay
Developer discussions range from in-person hallway chats to comment chains on bug reports. Being able to identify discussions that touch on software design would be helpful in documentation and refactoring software. Design mining is the application of machine learning techniques to correctly label a given discussion artifact, such as a pull request, as pertaining (or not) to design. In this paper we demonstrate a simple example of how design mining works. We then show how conclusion stability is poor on different artifact types and different projects. We show two techniques -- augmentation and context specificity -- that greatly improve the conclusion stability and cross-project relevance of design mining. Our new approach achieves AUC of 0.88 on within dataset classification and 0.80 on the cross-dataset classification task.
SEFeb 13, 2021
ADEPT: A Socio-Technical Theory of Continuous IntegrationOmar Elazhary, Margaret-Anne Storey, Neil A. Ernst et al.
Continuous practices that rely on automation in the software development workflow have been widely adopted by industry for over a decade. Despite this widespread use, software development remains a primarily human-driven activity that is highly creative and collaborative. There has been extensive research on how continuous practices rely on automation and its impact on software quality and development velocity, but relatively little has been done to understand how automation impacts developer behavior and collaboration. In this paper, we introduce a socio-technical theory about continuous practices. The ADEPT theory combines constructs that include humans, processes, documentation, automation and the project environment, and describes propositions that relate these constructs. The theory was derived from phenomena observed in previous empirical studies. We show how the ADEPT theory can explain and describe existing continuous practices in software development, and how it can be used to generate new propositions for future studies to understand continuous practices and their impact on the social and technical aspects of software development.
SESep 2, 2020
Understanding Peer Review of Software Engineering PapersNeil A. Ernst, Jeffrey C. Carver, Daniel Mendez et al.
Peer review is a key activity intended to preserve the quality and integrity of scientific publications. However, in practice it is far from perfect. We aim at understanding how reviewers, including those who have won awards for reviewing, perform their reviews of software engineering papers to identify both what makes a good reviewing approach and what makes a good paper. We first conducted a series of in-person interviews with well-respected reviewers in the software engineering field. Then, we used the results of those interviews to develop a questionnaire used in an online survey and sent out to reviewers from well-respected venues covering a number of software engineering disciplines, some of whom had won awards for their reviewing efforts. We analyzed the responses from the interviews and from 175 reviewers who completed the online survey (including both reviewers who had won awards and those who had not). We report on several descriptive results, including: 45% of award-winners are reviewing 20+ conference papers a year, while 28% of non-award winners conduct that many. 88% of reviewers are taking more than two hours on journal reviews. We also report on qualitative results. To write a good review, the important criteria were it should be factual and helpful, ranked above others such as being detailed or kind. The most important features of papers that result in positive reviews are clear and supported validation, an interesting problem, and novelty. Conversely, negative reviews tend to result from papers that have a mismatch between the method and the claims and from those with overly grandiose claims. The main recommendation for authors is to make the contribution of the work very clear in their paper. In addition, reviewers viewed data availability and its consistency as being important.
SEJan 6, 2020
Cross-Dataset Design Discussion MiningAlvi Mahadi, Karan Tongay, Neil A. Ernst
Being able to identify software discussions that are primarily about design, which we call design mining, can improve documentation and maintenance of software systems. Existing design mining approaches have good classification performance using natural language processing (NLP) techniques, but the conclusion stability of these approaches is generally poor. A classifier trained on a given dataset of software projects has so far not worked well on different artifacts or different datasets. In this study, we replicate and synthesize these earlier results in a meta-analysis. We then apply recent work in transfer learning for NLP to the problem of design mining. However, for our datasets, these deep transfer learning classifiers perform no better than less complex classifiers. We conclude by discussing some reasons behind the transfer learning approach to design mining.
SEMay 30, 2019
The Who, What, How of Software Engineering Research: A Socio-Technical FrameworkMargaret-Anne Storey, Neil A. Ernst, Courtney Williams et al.
Software engineering is a socio-technical endeavor, and while many of our contributions focus on technical aspects, human stakeholders such as software developers are directly affected by and can benefit from our research and tool innovations. In this paper, we question how much of our research addresses human and social issues, and explore how much we study human and social aspects in our research designs. To answer these questions, we developed a socio-technical research framework to capture the main beneficiary of a research study (the who), the main type of research contribution produced (the what), and the research strategies used in the study (how we methodologically approach delivering relevant results given the who and what of our studies). We used this Who-What-How framework to analyze 151 papers from two well-cited publishing venues---the main technical track at the International Conference on Software Engineering, and the Empirical Software Engineering Journal by Springer---to assess how much this published research explicitly considers human aspects. We find that although a majority of these papers claim the contained research should benefit human stakeholders, most focus on technical contributions without engaging humans in their studies. Although our analysis is scoped to two venues, our results suggest a need for more diversification and triangulation of research strategies. In particular, there is a need for strategies that aim at a deeper understanding of human and social aspects of software development practice to balance the design and evaluation of technical innovations. We recommend that the framework should be used in the design of future studies in order to nudge software engineering research towards explicitly including human and social concerns in their designs, and to improve the relevance of our research for human stakeholders.
SESep 26, 2018
A Method to Assess and Argue for Practical Significance in Software EngineeringRichard Torkar, Carlo A. Furia, Robert Feldt et al.
A key goal of empirical research in software engineering is to assess practical significance, which answers whether the observed effects of some compared treatments show a relevant difference in practice in realistic scenarios. Even though plenty of standard techniques exist to assess statistical significance, connecting it to practical significance is not straightforward or routinely done; indeed, only a few empirical studies in software engineering assess practical significance in a principled and systematic way. In this paper, we argue that Bayesian data analysis provides suitable tools to assess practical significance rigorously. We demonstrate our claims in a case study comparing different test techniques. The case study's data was previously analyzed (Afzal et al., 2015) using standard techniques focusing on statistical significance. Here, we build a multilevel model of the same data, which we fit and validate using Bayesian techniques. Our method is to apply cumulative prospect theory on top of the statistical model to quantitatively connect our statistical analysis output to a practically meaningful context. This is then the basis both for assessing and arguing for practical significance. Our study demonstrates that Bayesian analysis provides a technically rigorous yet practical framework for empirical software engineering. A substantial side effect is that any uncertainty in the underlying data will be propagated through the statistical model, and its effects on practical significance are made clear. Thus, in combination with cumulative prospect theory, Bayesian analysis supports seamlessly assessing practical significance in an empirical software engineering context, thus potentially clarifying and extending the relevance of research for practitioners.
SEApr 6, 2018
Bayesian Hierarchical Modelling for Tailoring Metric ThresholdsNeil A. Ernst
Software is highly contextual. While there are cross-cutting `global' lessons, individual software projects exhibit many `local' properties. This data heterogeneity makes drawing local conclusions from global data dangerous. A key research challenge is to construct locally accurate prediction models that are informed by global characteristics and data volumes. Previous work has tackled this problem using clustering and transfer learning approaches, which identify locally similar characteristics. This paper applies a simpler approach known as Bayesian hierarchical modeling. We show that hierarchical modeling supports cross-project comparisons, while preserving local context. To demonstrate the approach, we conduct a conceptual replication of an existing study on setting software metrics thresholds. Our emerging results show our hierarchical model reduces model prediction error compared to a global approach by up to 50%.
SEFeb 18, 2017
"SHORT"er Reasoning About Larger Requirements ModelsGeorge Mathew, Tim Menzies, Neil A. Ernst et al.
When Requirements Engineering(RE) models are unreasonably complex, they cannot support efficient decision making. SHORT is a tool to simplify that reasoning by exploiting the "key" decisions within RE models. These "keys" have the property that once values are assigned to them, it is very fast to reason over the remaining decisions. Using these "keys", reasoning about RE models can be greatly SHORTened by focusing stakeholder discussion on just these key decisions. This paper evaluates the SHORT tool on eight complex RE models. We find that the number of keys are typically only 12% of all decisions. Since they are so few in number, keys can be used to reason faster about models. For example, using keys, we can optimize over those models (to achieve the most goals at least cost) two to three orders of magnitude faster than standard methods. Better yet, finding those keys is not difficult: SHORT runs in low order polynomial time and terminates in a few minutes for the largest models.