SEJun 16, 2023Code
State-Of-The-Practice in Quality Assurance in Java-Based Open Source Software DevelopmentAli Khatami, Andy Zaidman
To ensure the quality of software systems, software engineers can make use of a variety of quality assurance approaches, such as software testing, modern code review, automated static analysis, and build automation. Each of these quality assurance practices has been studied in depth in isolation, but there is a clear knowledge gap when it comes to our understanding of how these approaches are being used in conjunction or not. In our study, we broadly investigate whether and how these quality assurance approaches are being used in conjunction in the development of 1,454 popular open source software projects on GitHub. Our study indicates that typically projects do not follow all quality assurance practices together with high intensity. In fact, we only observe weak correlation among some quality assurance practices. In general, our study provides a deeper understanding of how existing quality assurance approaches are currently being used in Java-based open source software development. Besides, we specifically zoomed in on the more mature projects in our dataset, and generally, we observe that more mature projects are more intense in their application of the quality assurance practices, with more focus on their ASAT usage and code reviewing, but no strong change in their CI usage.
16.1SEMar 23
On the Emergence of Testing Strategies: A Socio-technical Grounded TheoryMark Swillus, Rashina Hoda, Andy Zaidman
Software testing is crucial for ensuring software quality, yet developers' engagement with it varies widely. Identifying the technical, organizational and social factors that lead to differences in engagement is required to remove barriers and utilize enablers for testing. While much research emphasizes the usefulness of software testing approaches and technical solutions, less is known about why developers do (not) test. This study investigates the first-hand experience of developers with software testing. The study illuminates how developers' opinions about testing and their testing behavior changes. Through analysis of personal evolutions of practice, we explore when and why testing is used. Employing socio-technical grounded theory (STGT), we construct a theory by systematically analyzing data from 19 in-depth, semi-structured interviews with software developers. Allowing interviewees to reflect on how and why they approach software testing, we explore perspectives that are rooted in their contextual experiences. We develop eleven categories of circumstances that act as conditions for the application and adaptation of testing practices and introduce three concepts that we then use to present a theory of emerging testing strategies (ETS) that explains why developers do (not) use testing practices. This study reveals a new perspective on the connection between testing artifacts and collective reflection of practitioners, and it embraces. It has direct implications for practice %and contributes to the groundwork of socio-technical research which embraces testing as an experience in which human- and social aspects are entangled with organizational and technical circumstances.
SEAug 21, 2024
Leveraging Large Language Models for Enhancing the Understandability of Generated Unit TestsAmirhossein Deljouyi, Roham Koohestani, Maliheh Izadi et al.
Automated unit test generators, particularly search-based software testing tools like EvoSuite, are capable of generating tests with high coverage. Although these generators alleviate the burden of writing unit tests, they often pose challenges for software engineers in terms of understanding the generated tests. To address this, we introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases. We achieve this enhancement through contextualizing test data, improving identifier naming, and adding descriptive comments. Through a controlled experiment with 32 participants from both academia and industry, we investigate how the understandability of unit tests affects a software engineer's ability to perform bug-fixing tasks. We selected bug-fixing to simulate a real-world scenario that emphasizes the importance of understandable test cases. We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases. From the post-test questionnaire, we gathered that participants found that enhanced test names, test data, and variable names improved their bug-fixing process.
6.7SEApr 19
Beyond the YAML File: Understanding Real-World GitHub Actions Workflow AdoptionAli Khatami, Carolin Brandt, Andy Zaidman
Continuous Integration and Continuous Deployment (CI/CD) have become fundamental to modern software development, with GitHub Actions (GHA) emerging as a dominant automation platform. In this study, we analyze real-world execution records of GHA, examining how developers react to workflow failures, how these workflows are utilized by projects, and how these aspects relate to project characteristics. We quantitatively analyze 258,300 workflow run records from 952 repositories and perform an in-depth qualitative analysis of 21 selected, diverse GitHub repositories to understand how maintainers and contributors interact with workflow results. We identify three distinct failure response patterns, observe that higher usage intensity of GHA workflows correlates with lower failure rates, and uncover a configuration-usage gap where the presence of configuration files masks disabled or unused workflows. Moreover, our qualitative analysis of relationships between project characteristics and utilization patterns yields five hypotheses for future validation.
SEMar 1, 2021Code
How Developers Engineer Test Cases: An Observational StudyMaurício Aniche, Christoph Treude, Andy Zaidman
One of the main challenges that developers face when testing their systems lies in engineering test cases that are good enough to reveal bugs. And while our body of knowledge on software testing and automated test case generation is already quite significant, in practice, developers are still the ones responsible for engineering test cases manually. Therefore, understanding the developers' thought- and decision-making processes while engineering test cases is a fundamental step in making developers better at testing software. In this paper, we observe 13 developers thinking-aloud while testing different real-world open-source methods, and use these observations to explain how developers engineer test cases. We then challenge and augment our main findings by surveying 72 software developers on their testing practices. We discuss our results from three different angles. First, we propose a general framework that explains how developers reason about testing. Second, we propose and describe in detail the three different overarching strategies that developers apply when testing. Third, we compare and relate our observations with the existing body of knowledge and propose future studies that would advance our knowledge on the topic.
SEJan 13, 2020Code
Generating Class-Level Integration Tests Using Call Site InformationPouria Derakhshanfar, Xavier Devroey, Annibale Panichella et al.
Search-based approaches have been used in the literature to automate the process of creating unit test cases. However, related work has shown that generated unit-tests with high code coverage could be ineffective, i.e., they may not detect all faults or kill all injected mutants. In this paper, we propose CLING, an integration-level test case generation approach that exploits how a pair of classes, the caller and the callee, interact with each other through method calls. In particular, CLING generates integration-level test cases that maximize the Coupled Branches Criterion (CBC). Coupled branches are pairs of branches containing a branch of the caller and a branch of the callee such that an integration test that exercises the former also exercises the latter. CBC is a novel integration-level coverage criterion, measuring the degree to which a test suite exercises the interactions between a caller and its callee classes. We implemented CLING and evaluated the approach on 140 pairs of classes from five different open-source Java projects. Our results show that (1) CLING generates test suites with high CBC coverage, thanks to the definition of the test suite generation as a many-objectives problem where each couple of branches is an independent objective; (2) such generated suites trigger different class interactions and can kill on average 7.7% (with a maximum of 50%) of mutants that are not detected by tests generated at the unit level; (3) CLING can detect integration faults coming from wrong assumptions about the usage of the callee class (32 for our subject systems) that remain undetected when using automatically generated unit-level test suites.
SEDec 10, 2019Code
Search-based Crash Reproduction using Behavioral Model SeedingPouria Derakhshanfar, Xavier Devroey, Gilles Perrouin et al.
Search-based crash reproduction approaches assist developers during debugging by generating a test case which reproduces a crash given its stack trace. One of the fundamental steps of this approach is creating objects needed to trigger the crash. One way to overcome this limitation is seeding: using information about the application during the search process. With seeding, the existing usages of classes can be used in the search process to produce realistic sequences of method calls which create the required objects. In this study, we introduce behavioral model seeding: a new seeding method which learns class usages from both the system under test and existing test cases. Learned usages are then synthesized in a behavioral model (state machine). Then, this model serves to guide the evolutionary process. To assess behavioral model-seeding, we evaluate it against test-seeding (the state-of-the-art technique for seeding realistic objects) and no-seeding (without seeding any class usage). For this evaluation, we use a benchmark of 124 hard-to-reproduce crashes stemming from six open-source projects. Our results indicate that behavioral model-seeding outperforms both test seeding and no-seeding by a minimum of 6% without any notable negative impact on efficiency.
SEJun 18, 2014Code
A Quality Framework for Agile Requirements: A Practitioner's PerspectivePetra Heck, Andy Zaidman
Verification activities are necessary to ensure that the requirements are specified in a correct way. However, until now requirements verification research has focused on traditional up-front requirements. Agile or just-in-time requirements are by definition incomplete, not specific and might be ambiguous when initially specified, indicating a different notion of 'correctness'. We analyze how verification of agile requirements quality should be performed, based on literature of traditional and agile requirements. This leads to an agile quality framework, instantiated for the specific requirement types of feature requests in open source projects and user stories in agile projects. We have performed an initial qualitative validation of our framework for feature requests with eight practitioners from the Dutch agile community, receiving overall positive feedback.
SEAug 27, 2021
Developer-Centric Test Amplification The Interplay Between Automatic Generation and Human ExplorationCarolin Brandt, Andy Zaidman
Automatically generating test cases for software has been an active research topic for many years. While current tools can generate powerful regression or crash-reproducing test cases, these are often kept separately from the maintained test suite. In this paper, we leverage the developer's familiarity with test cases amplified from existing, manually written developer tests. Starting from issues reported by developers in previous studies, we investigate what aspects are important to design a developer-centric test amplification approach, that provides test cases that are taken over by developers into their test suite. We conduct 16 semi-structured interviews with software developers supported by our prototypical designs of a developer-centric test amplification approach and a corresponding test exploration tool. We extend the test amplification tool DSpot, generating test cases that are easier to understand. Our IntelliJ plugin TestCube empowers developers to explore amplified test cases from their familiar environment. From our interviews, we gather 52 observations that we summarize into 23 result categories and give two key recommendations on how future tool designers can make their tools better suited for developer-centric test amplification.
SEJul 13, 2021
Promises and Perils of Inferring Personality on GitHubFrenk van Mil, Ayushi Rastogi, Andy Zaidman
Personality plays a pivotal role in our understanding of human actions and behavior. Today, the applications of personality are widespread, built on the solutions from psychology to infer personality. In software engineering, for instance, one widely used solution to infer personality uses textual communication data. As studies on personality in software engineering continue to grow, it is imperative to understand the performance of these solutions. This paper compares the inferential ability of three widely studied text-based personality tests against each other and the ground truth on GitHub. We explore the challenges and potential solutions to improve the inferential ability of personality tests. Our study shows that solutions for inferring personality are far from being perfect. Software engineering communications data can infer individual developer personality with an average error rate of 41%. In the best case, the error rate can be reduced up to 36% by following our recommendations.
SEAug 6, 2019
Do as I Do, Not as I Say: Do Contribution Guidelines Match the GitHub Contribution Process?Omar Elazhary, Margaret-Anne Storey, Neil Ernst et al.
Developer contribution guidelines are used in social coding sites like GitHub to explain and shape the process a project expects contributors to follow. They set standards for all participants and "save time and hassle caused by improperly created pull requests or issues that have to be rejected and resubmitted" (GitHub). Yet, we lack a systematic understanding of the content of a typical contribution guideline, as well as the extent to which these guidelines are followed in practice. Additionally, understanding how guidelines may impact projects that use Continuous Integration as part of the contribution process is of particular interest. To address this knowledge gap, we conducted a mixed-methods study of 53 GitHub projects with explicit contribution guidelines and coded the guidelines to extract key themes. We then created a process model using GitHub activity data (e.g., commit, new issue, new pull request) to compare the actual activity with the prescribed contribution guidelines. We show that approximately 68% of these projects diverge significantly from the expected process.
SEJul 25, 2019
Not All Bugs Are the Same: Understanding, Characterizing, and Classifying the Root Cause of BugsGemma Catolino, Fabio Palomba, Andy Zaidman et al.
Modern version control systems such as Git or SVN include bug tracking mechanisms, through which developers can highlight the presence of bugs through bug reports, i.e., textual descriptions reporting the problem and what are the steps that led to a failure. In past and recent years, the research community deeply investigated methods for easing bug triage, that is, the process of assigning the fixing of a reported bug to the most qualified developer. Nevertheless, only a few studies have reported on how to support developers in the process of understanding the type of a reported bug, which is the first and most time-consuming step to perform before assigning a bug-fix operation. In this paper, we target this problem in two ways: first, we analyze 1,280 bug reports of 119 popular projects belonging to three ecosystems such as Mozilla, Apache, and Eclipse, with the aim of building a taxonomy of the root causes of reported bugs; then, we devise and evaluate an automated classification model able to classify reported bugs according to the defined taxonomy. As a result, we found nine main common root causes of bugs over the considered systems. Moreover, our model achieves high F-Measure and AUC-ROC (64% and 74% on overall, respectively).
SEMay 26, 2019
Improving Change Prediction Models with Code Smell-Related InformationGemma Catolino, Fabio Palomba, Francesca Arcelli Fontana et al.
Code smells represent sub-optimal implementation choices applied by developers when evolving software systems. The negative impact of code smells has been widely investigated in the past: besides developers' productivity and ability to comprehend source code, researchers empirically showed that the presence of code smells heavily impacts the change-proneness of the affected classes. On the basis of these findings, in this paper we conjecture that code smell-related information can be effectively exploited to improve the performance of change prediction models, ie models having as goal that of indicating to developers which classes are more likely to change in the future, so that they may apply preventive maintenance actions. Specifically, we exploit the so-called intensity index - a previously defined metric that captures the severity of a code smell - and evaluate its contribution when added as additional feature in the context of three state of the art change prediction models based on product, process, and developer-based features. We also compare the performance achieved by the proposed model with the one of an alternative technique that considers the previously defined antipattern metrics, namely a set of indicators computed considering the history of code smells in files. Our results report that (i) the prediction performance of the intensity-including models is statistically better than that of the baselines and (ii) the intensity is a more powerful metric with respect to the alternative smell-related ones.
SEDec 20, 2018
Automatic Quality Assurance and Release (Report from Dagstuhl Seminar 18122)Bram Adams, Benoit Baudry, Sigrid Eldh et al.
This report documents the program and the outcomes of Dagstuhl Seminar 18122 "Automatic Quality Assurance and Release". The main goal of this seminar was to bridge the knowledge divide on how researchers and industry professionals reason about and implement DevOps for automatic quality assurance. Through the seminar, we have built up a common understanding of DevOps tools and practices, but we have also identified major academic and educational challenges for this field of research.
SEMay 30, 2017
A Snowballing Literature Study on Test AmplificationBenjamin Danglot, Oscar Luis Vera-Pérez, Zhongxing Yu et al.
The adoption of agile development approaches has put an increased emphasis on developer testing, resulting in software projects with strong test suites. These suites include a large number of test cases, in which developers embed knowledge about meaningful input data and expected properties in the form of oracles. This article surveys various works that aim at exploiting this knowledge in order to enhance these manually written tests with respect to an engineering goal (e.g., improve coverage of changes or increase the accuracy of fault localization). While these works rely on various techniques and address various goals, we believe they form an emerging and coherent field of research, which we call `test amplification'. We devised a first set of papers from DBLP, looking for all papers containing `test' and `amplification' in their title. We reviewed the 70 papers in this set and selected the 4 papers that fit our definition of test amplification. We use these 4 papers as the seed for our snowballing study, and systematically followed the citation graph. This study is the first that draws a comprehensive picture of the different engineering goals proposed in the literature for test amplification. In particular, we note that the goal of test amplification goes far beyond maximizing coverage only. We believe that this survey will help researchers and practitioners entering this new field to understand more quickly and more deeply the intuitions, concepts and techniques used for test amplification.
SEJul 16, 2014
Web API Fragility: How Robust is Your Web API ClientTiago Espinha, Andy Zaidman, Hans-Gerhard Gross
Web APIs provide a systematic and extensible approach for application-to-application interaction. A large number of mobile applications makes use of web APIs to integrate services into apps. Each Web API's evolution pace is determined by their respective developer and mobile application developers are forced to accompany the API providers in their software evolution tasks. In this paper we investigate whether mobile application developers understand and how they deal with the added distress of web APIs evolving. In particular, we studied how robust 48 high profile mobile applications are when dealing with mutated web API responses. Additionally, we interviewed three mobile application developers to better understand their choices and trade-offs regarding web API integration.