5.7SEApr 22
Quo Vadis, Code Review? Exploring the Future of Code ReviewMichael Dorner, Andreas Bauer, Darja Šmite et al.
Context: Code review has long been a core practice in collaborative software engineering. As automation becomes increasingly embedded in development workflows, the role and functioning of code review are subject to change. Objective: This study explores how professional developers anticipate the evolution of code review and identifies emerging tensions reflected in these expectations. Method: We conducted a cross-sectional survey with 100 developers across five software-driven companies. The survey captured estimates of current review time and reviewed artifacts, as well as anticipated changes over a five-year horizon. Open-ended questions invited reflections on the future of code review. Quantitative responses were analyzed descriptively, and open-ended responses were independently coded by multiple researchers using thematic analysis to identify recurring patterns in participant responses. Results: Practitioners expect code review to remain essential, anticipating stable or increased time investment and a broader range of reviewed artifacts over the next five years. In open-ended responses, many participants explicitly referenced AI and large language models (LLMs), describing increasing automation in both code authoring and reviewing, including scenarios in which automated systems operate in both roles. Conclusion: Our analysis suggests emerging tensions concerning understanding, accountability, and trust in automation-mediated code review. These tensions provide early empirical signals of socio-technical challenges and position code review as a concrete setting for examining the implications of LLM integration in collaborative software engineering.
SEMar 6
Real-World Fault Detection for C-Extended Python Projects with Automated Unit Test GenerationLucas Berg, Lukas Krodinger, Stephan Lukasczyk et al.
Many popular Python libraries use C-extensions for performance-critical operations allowing users to combine the best of the two worlds: The simplicity and versatility of Python and the performance of C. A drawback of this approach is that exceptions raised in C can bypass Python's exception handling and cause the entire interpreter to crash. These crashes are real faults if they occur when calling a public API. While automated test generation should, in principle, detect such faults, crashes in native code can halt the test process entirely, preventing detection or reproduction of the underlying errors and inhibiting coverage of non-crashing parts of the code. To overcome this problem, we propose separating the generation and execution stages of the test-generation process. We therefore adapt Pynguin, an automated test case generation tool for Python, to use subprocess-execution. Executing each generated test in an isolated subprocess prevents a crash from halting the test generation process itself. This allows us to (1) detect such faults, (2) generate reproducible crash-revealing test cases for them, (3) allow studying the underlying faults, and (4) enable test generation for non-crashing parts of the code. To evaluate our approach, we created a dataset consisting of 1648 modules from 21 popular Python libraries with C-extensions. Subprocess-execution allowed automated testing of up to 56.5% more modules and discovered 213 unique crash causes, revealing 32 previously unknown faults.
SEJan 22, 2021Code
An Empirical Study of Flaky Tests in PythonMartin Gruber, Stephan Lukasczyk, Florian Kroiß et al.
Tests that cause spurious failures without any code changes, i.e., flaky tests, hamper regression testing, increase maintenance costs, may shadow real bugs, and decrease trust in tests. While the prevalence and importance of flakiness is well established, prior research focused on Java projects, thus raising the question of how the findings generalize. In order to provide a better understanding of the role of flakiness in software development beyond Java, we empirically study the prevalence, causes, and degree of flakiness within software written in Python, one of the currently most popular programming languages. For this, we sampled 22352 open source projects from the popular PyPI package index, and analyzed their 876186 test cases for flakiness. Our investigation suggests that flakiness is equally prevalent in Python as it is in Java. The reasons, however, are different: Order dependency is a much more dominant problem in Python, causing 59% of the 7571 flaky tests in our dataset. Another 28% were caused by test infrastructure problems, which represent a previously undocumented cause of flakiness. The remaining 13% can mostly be attributed to the use of network and randomness APIs by the projects, which is indicative of the type of software commonly written in Python. Our data also suggests that finding flaky tests requires more runs than are often done in the literature: A 95% confidence that a passing test case is not flaky on average would require 170 reruns.
SEFeb 10, 2022
Pynguin: Automated Unit Test Generation for PythonStephan Lukasczyk, Gordon Fraser
Automated unit test generation is a well-known methodology aiming to reduce the developers' effort of writing tests manually. Prior research focused mainly on statically typed programming languages like Java. In practice, however, dynamically typed languages have received a huge gain in popularity over the last decade. This introduces the need for tools and research on test generation for these languages, too. We introduce Pynguin, an extendable test-generation framework for Python, which generates regression tests with high code coverage. Pynguin is designed to be easily usable by practitioners; it is also extensible to allow researchers to adapt it for their needs and to enable future research. We provide a demo of Pynguin at https://youtu.be/UiGrG25Vts0; further information, documentation, the tool, and its source code are available at https://www.pynguin.eu.
SENov 9, 2021
An Empirical Study of Automated Unit Test Generation for PythonStephan Lukasczyk, Florian Kroiß, Gordon Fraser
Various mature automated test generation tools exist for statically typed programming languages such as Java. Automatically generating unit tests for dynamically typed programming languages such as Python, however, is substantially more difficult due to the dynamic nature of these languages as well as the lack of type information. Our Pynguin framework provides automated unit test generation for Python. In this paper, we extend our previous work on Pynguin to support more aspects of the Python language, and by studying a larger variety of well-established state of the art test-generation algorithms, namely DynaMOSA, MIO, and MOSA. Furthermore, we improved our Pynguin tool to generate regression assertions, whose quality we also evaluate. Our experiments confirm that evolutionary algorithms can outperform random test generation also in the context of Python, and similar to the Java world, DynaMOSA yields the highest coverage results. However, our results also demonstrate that there are still fundamental remaining issues, such as inferring type information for code without this information, currently limiting the effectiveness of test generation for Python.
SEAug 16, 2021
Improving Readability of Scratch Programs with Search-based RefactoringFelix Adler, Gordon Fraser, Eva Gründinger et al.
Block-based programming languages like Scratch have become increasingly popular as introductory languages for novices. These languages are intended to be used with a "tinkering" approach which allows learners and teachers to quickly assemble working programs and games, but this often leads to low code quality. Such code can be hard to comprehend, changing it is error-prone, and learners may struggle and lose interest. The general solution to improve code quality is to refactor the code. However, Scratch lacks many of the common abstraction mechanisms used when refactoring programs written in higher programming languages. In order to improve Scratch code, we therefore propose a set of atomic code transformations to optimise readability by (1) rewriting control structures and (2) simplifying scripts using the inherently concurrent nature of Scratch programs. By automating these transformations it is possible to explore the space of possible variations of Scratch programs. In this paper, we describe a multi-objective search-based approach that determines sequences of code transformations which improve the readability of a given Scratch program and therefore form refactorings. Evaluation on a random sample of 1000 Scratch programs demonstrates that the generated refactorings reduce complexity and entropy in 70.4% of the cases, and 354 projects are improved in at least one metric without making any other metric worse. The refactored programs can help both novices and their teachers to improve their code.
SEJul 28, 2020
Automated Unit Test Generation for PythonStephan Lukasczyk, Florian Kroiß, Gordon Fraser
Automated unit test generation is an established research field, and mature test generation tools exist for statically typed programming languages such as Java. It is, however, substantially more difficult to automatically generate supportive tests for dynamically typed programming languages such as Python, due to the lack of type information and the dynamic nature of the language. In this paper, we describe a foray into the problem of unit test generation for dynamically typed languages. We introduce Pynguin, an automated unit test generation framework for Python. Using Pynguin, we aim to empirically shed light on two central questions: (1) Do well-established search-based test generation methods, previously evaluated only on statically typed languages, generalise to dynamically typed languages? (2) What is the influence of incomplete type information and dynamic typing on the problem of automated test generation? Our experiments confirm that evolutionary algorithms can outperform random test generation also in the context of Python, and can even alleviate the problem of absent type information to some degree. However, our results demonstrate that dynamic typing nevertheless poses a fundamental issue for test generation, suggesting future work on integrating type inference.