Gordon Fraser

SE
31papers
778citations
Novelty40%
AI Score51

31 Papers

SEMar 6
Real-World Fault Detection for C-Extended Python Projects with Automated Unit Test Generation

Lucas Berg, Lukas Krodinger, Stephan Lukasczyk et al.

Many popular Python libraries use C-extensions for performance-critical operations allowing users to combine the best of the two worlds: The simplicity and versatility of Python and the performance of C. A drawback of this approach is that exceptions raised in C can bypass Python's exception handling and cause the entire interpreter to crash. These crashes are real faults if they occur when calling a public API. While automated test generation should, in principle, detect such faults, crashes in native code can halt the test process entirely, preventing detection or reproduction of the underlying errors and inhibiting coverage of non-crashing parts of the code. To overcome this problem, we propose separating the generation and execution stages of the test-generation process. We therefore adapt Pynguin, an automated test case generation tool for Python, to use subprocess-execution. Executing each generated test in an isolated subprocess prevents a crash from halting the test generation process itself. This allows us to (1) detect such faults, (2) generate reproducible crash-revealing test cases for them, (3) allow studying the underlying faults, and (4) enable test generation for non-crashing parts of the code. To evaluate our approach, we created a dataset consisting of 1648 modules from 21 popular Python libraries with C-extensions. Subprocess-execution allowed automated testing of up to 56.5% more modules and discovered 213 unique crash causes, revealing 32 previously unknown faults.

75.5SEApr 23Code
Generalizing Test Cases for Comprehensive Test Scenario Coverage

Binhang Qi, Yun Lin, Xinyi Weng et al.

Test cases are essential for software development and maintenance. In practice, developers derive multiple test cases from an implicit pattern based on their understanding of requirements and inference of diverse test scenarios, each validating a specific behavior of the focal method. However, producing comprehensive tests is time-consuming and error-prone: many important tests that should have accompanied the initial test are added only after a significant delay, sometimes only after bugs are triggered. Existing automated test generation techniques largely focus on code coverage. Yet in real projects, practical tests are seldom driven by code coverage alone, since test scenarios do not necessarily align with control-flow branches. Instead, test scenarios originate from requirements, which are often undocumented and implicitly embedded in a project's design and implementation. However, developer-written tests are frequently treated as executable specifications; thus, even a single initial test that reflects the developer's intent can reveal the underlying requirement and the diverse scenarios that should be validated. In this work, we propose TestGeneralizer, a framework for generalizing test cases to comprehensively cover test scenarios. TestGeneralizer orchestrates three stages: (1) enhancing the understanding of the requirement and scenario behind the focal method and initial test; (2) generating a test scenario template and crystallizing it into various test scenario instances; and (3) generating and refining executable test cases from these instances. We evaluate TestGeneralizer against three state-of-the-art baselines on 12 open-source Java projects. TestGeneralizer achieves significant improvements: +31.66% and +23.08% over ChatTester, in mutation-based and LLM-assessed scenario coverage, respectively.

SEJul 6, 2024
Combining Neuroevolution with the Search for Novelty to Improve the Generation of Test Inputs for Games

Patric Feldmeier, Gordon Fraser

As games challenge traditional automated white-box test generators, the Neatest approach generates test suites consisting of neural networks that exercise the source code by playing the games. Neatest generates these neural networks using an evolutionary algorithm that is guided by an objective function targeting individual source code statements. This approach works well if the objective function provides sufficient guidance, but deceiving or complex fitness landscapes may inhibit the search. In this paper, we investigate whether the issue of challenging fitness landscapes can be addressed by promoting novel behaviours during the search. Our case study on two Scratch games demonstrates that rewarding novel behaviours is a promising approach for overcoming challenging fitness landscapes, thus enabling future research on how to adapt the search algorithms to best use this information.

16.9SEMar 30
Voice-Controlled Scratch for Children with (Motor) Disabilities

Elias Goller, Gordon Fraser, Isabella Graßl

Block-based programming environments like Scratch have become widely adopted in Computer Science Education, but the mouse-based drag-and-drop interface can challenge users with disabilities. While prior work has provided solutions supporting children with visual impairment, these solutions tend to focus on making content perceivable and do not address the physical interaction barriers faced by users with motor disabilities. To bridge this gap, we introduce MeowCrophone, an approach that uses voice control to allow editing code in Scratch. MeowCrophone supports clicking elements, placing blocks, and navigating the workspace via a multi-modal voice user interface that uses numerical overlays and label reading to bypass physical input entirely. As imperfect speech recognition is common in classrooms and for children with dysarthria, MeowCrophone employs a multi-stage matching pipeline using regular expressions, phonetic matching, and a custom grammar. Evaluation shows that while free speech recognition systems achieved a baseline success rate of only 46.4%, MeowCrophone's pipeline improved results to 82.8% overall, with simple commands reaching 96.9% accuracy. This demonstrates that robust voice control can make Scratch accessible to users for whom visual aids are insufficient.

8.1SEMar 27
From Personas to Programming: Gender-specific Effects of Design Thinking-Based Computing Education at Secondary Schools

Isabella Graßl, Gordon Fraser, Daniela Damian

Creative approaches to attract students to software engineering at an early age are emerging, yet their differential impact on gender remains unclear. This study investigates whether design thinking's empathy-driven approach addresses the documented gender gap in interest in software engineering. In a 10-week curriculum-integrated design thinking software development course with 55 secondary school students aged 13-15 from two schools in Canada, we examined gendered differences in perceived gains in knowledge and interest, as well as in social-emotional experiences. Our results show that both girls and boys gained perceived knowledge in software development. However, girls showed significant improvements in self-efficacy, interest, engagement with sustainability topics, and well-being, including optimism, sense of usefulness, and social connectedness. Positive emotions were strongest during creative, collaborative phases, while technical tasks led to some boredom, especially among boys, though they still benefited overall. This suggests that human-centred design thinking might be one effective way to address gender equity challenges, though we need more differentiated technical implementations.

SEFeb 15, 2021Code
LitterBox: A Linter for Scratch Programs

Gordon Fraser, Ute Heuer, Nina Körber et al.

Creating programs with block-based programming languages like Scratch is easy and fun. Block-based programs can nevertheless contain bugs, in particular when learners have misconceptions about programming. Even when they do not, Scratch code is often of low quality and contains code smells, further inhibiting understanding, reuse, and fun. To address this problem, in this paper we introduce LitterBox, a linter for Scratch programs. Given a program or its public project ID, LitterBox checks the program against patterns of known bugs and code smells. For each issue identified, LitterBox provides not only the location in the code, but also a helpful explanation of the underlying reason and possible misconceptions. Learners can access LitterBox through an easy to use web interface with visual information about the errors in the block-code, while for researchers LitterBox provides a general, open source, and extensible framework for static analysis of Scratch programs.

SEJan 22, 2021Code
An Empirical Study of Flaky Tests in Python

Martin Gruber, Stephan Lukasczyk, Florian Kroiß et al.

Tests that cause spurious failures without any code changes, i.e., flaky tests, hamper regression testing, increase maintenance costs, may shadow real bugs, and decrease trust in tests. While the prevalence and importance of flakiness is well established, prior research focused on Java projects, thus raising the question of how the findings generalize. In order to provide a better understanding of the role of flakiness in software development beyond Java, we empirically study the prevalence, causes, and degree of flakiness within software written in Python, one of the currently most popular programming languages. For this, we sampled 22352 open source projects from the popular PyPI package index, and analyzed their 876186 test cases for flakiness. Our investigation suggests that flakiness is equally prevalent in Python as it is in Java. The reasons, however, are different: Order dependency is a much more dominant problem in Python, causing 59% of the 7571 flaky tests in our dataset. Another 28% were caused by test infrastructure problems, which represent a previously undocumented cause of flakiness. The remaining 13% can mostly be attributed to the use of network and randomness APIs by the projects, which is indicative of the type of software commonly written in Python. Our data also suggests that finding flaky tests requires more runs than are often done in the literature: A 95% confidence that a passing test case is not flaky on average would require 170 reruns.

SEFeb 14, 2022
Gamekins: Gamifying Software Testing in Jenkins

Philipp Straubinger, Gordon Fraser

Developers have to write thorough tests for their software in order to find bugs and to prevent regressions. Writing tests, however, is not every developer's favourite occupation, and if a lack of motivation leads to a lack of tests, then this may have dire consequences, such as programs with poor quality or even project failures. This paper introduces Gamekins, a tool that uses gamification to motivate developers to write more and better tests. Gamekins is integrated into the Jenkins continuous integration platform where game elements are based on commits to the source code repository: Developers can earn points for completing test challenges and quests posed by Gamekins, compete with other developers or developer teams on a leaderboard, and are rewarded for their test-related achievements.

SEFeb 13, 2022
Automated Test Generation for Scratch Programs

Adina Deiner, Patric Feldmeier, Gordon Fraser et al.

The importance of programming education has lead to dedicated educational programming environments, where users visually arrange block-based programming constructs that typically control graphical, interactive game-like programs. The Scratch programming environment is particularly popular, with more than 70 million registered users at the time of this writing. While the block-based nature of Scratch helps learners by preventing syntactical mistakes, there nevertheless remains a need to provide feedback and support in order to implement desired functionality. To support individual learning and classroom settings, this feedback and support should ideally be provided in an automated fashion, which requires tests to enable dynamic program analysis. The Whisker framework enables automated testing of Scratch programs, but creating these automated tests for Scratch programs is challenging. In this paper, we therefore investigate how to automatically generate Whisker tests. This raises important challenges: First, game-like programs are typically randomised, leading to flaky tests. Second, Scratch programs usually consist of animations and interactions with long delays, inhibiting the application of classical test generation approaches. Evaluation on common programming exercises, a random sample of 1000 Scratch user programs, and the 1000 most popular Scratch programs demonstrates that our approach enables Whisker to reliably accelerate test executions, and even though many Scratch programs are small and easy to cover, there are many unique challenges for which advanced search-based test generation using many-objective algorithms is needed in order to achieve high coverage.

SEFeb 13, 2022
Model-based Testing of Scratch Programs

Katharina Götz, Patric Feldmeier, Gordon Fraser

Learners are often introduced to programming via dedicated languages such as Scratch, where block-based commands are assembled visually in order to control the interactions of graphical sprites. Automated testing of such programs is an important prerequisite for supporting debugging, providing hints, or assessing learning outcomes. However, writing tests for Scratch programs can be challenging: The game-like and randomised nature of typical Scratch programs makes it difficult to identify specific timed input sequences used to control the programs. Furthermore, precise test assertions to check the resulting program states are incompatible with the fundamental principle of creative freedom in programming in Scratch, where correct program behaviour may be implemented with deviations in the graphical appearance or timing of the program. The event-driven and actor-oriented nature of Scratch programs, however, makes them a natural fit for describing program behaviour using finite state machines. In this paper, we introduce a model-based testing approach by extending Whisker, an automated testing framework for Scratch programs. The model-based extension describes expected program behaviour in terms of state machines, which makes it feasible to check the abstract behaviour of a program independent of exact timing and pixel-precise graphical details, and to automatically derive test inputs testing even challenging programs. A video demonstrating model-based testing with Whisker is available at the following URL: https://youtu.be/edgCNbGSGEY

SEFeb 10, 2022
Pynguin: Automated Unit Test Generation for Python

Stephan Lukasczyk, Gordon Fraser

Automated unit test generation is a well-known methodology aiming to reduce the developers' effort of writing tests manually. Prior research focused mainly on statically typed programming languages like Java. In practice, however, dynamically typed languages have received a huge gain in popularity over the last decade. This introduces the need for tools and research on test generation for these languages, too. We introduce Pynguin, an extendable test-generation framework for Python, which generates regression tests with high code coverage. Pynguin is designed to be easily usable by practitioners; it is also extensible to allow researchers to adapt it for their needs and to enable future research. We provide a demo of Pynguin at https://youtu.be/UiGrG25Vts0; further information, documentation, the tool, and its source code are available at https://www.pynguin.eu.

SEDec 1, 2021
Common Bugs in Scratch Programs

Christoph Frädrich, Florian Obermüller, Nina Körber et al.

Bugs in Scratch programs can spoil the fun and inhibit learning success. Many common bugs are the result of recurring patterns of bad code. In this paper we present a collection of common code patterns that typically hint at bugs in Scratch programs, and the LitterBox tool which can automatically detect them. We empirically evaluate how frequently these patterns occur, and how severe their consequences usually are. While fixing bugs inevitably is part of learning, the possibility to identify the bugs automatically provides the potential to support learners

SENov 9, 2021
An Empirical Study of Automated Unit Test Generation for Python

Stephan Lukasczyk, Florian Kroiß, Gordon Fraser

Various mature automated test generation tools exist for statically typed programming languages such as Java. Automatically generating unit tests for dynamically typed programming languages such as Python, however, is substantially more difficult due to the dynamic nature of these languages as well as the lack of type information. Our Pynguin framework provides automated unit test generation for Python. In this paper, we extend our previous work on Pynguin to support more aspects of the Python language, and by studying a larger variety of well-established state of the art test-generation algorithms, namely DynaMOSA, MIO, and MOSA. Furthermore, we improved our Pynguin tool to generate regression assertions, whose quality we also evaluate. Our experiments confirm that evolutionary algorithms can outperform random test generation also in the context of Python, and similar to the Java world, DynaMOSA yields the highest coverage results. However, our results also demonstrate that there are still fundamental remaining issues, such as inferring type information for code without this information, currently limiting the effectiveness of test generation for Python.

CYNov 1, 2021
Challenging but Full of Opportunities: Teachers' Perspectives on Programming in Primary Schools

Luisa Greifenstein, Isabella Graßl, Gordon Fraser

The widespread establishment of computational thinking in school curricula requires teachers to introduce children to programming already at primary school level. As this is a recent development, primary school teachers may neither be adequately prepared for how to best teach programming, nor may they be fully aware why they have to do so. In order to gain a better understanding of these questions, we contrast insights taken from practical experiences with the anticipations of teachers in training. By surveying 200 teachers who have taught programming at primary schools and 97 teachers in training, we identify relevant challenges when teaching programming, opportunities that arise when children learn programming, and strategies how to address both of these in practice. While many challenges and opportunities are correctly anticipated, we find several disagreements that can inform revisions of the curricula in teaching studies to better prepare primary school teachers for teaching programming at primary schools.

SEAug 16, 2021
Improving Readability of Scratch Programs with Search-based Refactoring

Felix Adler, Gordon Fraser, Eva Gründinger et al.

Block-based programming languages like Scratch have become increasingly popular as introductory languages for novices. These languages are intended to be used with a "tinkering" approach which allows learners and teachers to quickly assemble working programs and games, but this often leads to low code quality. Such code can be hard to comprehend, changing it is error-prone, and learners may struggle and lose interest. The general solution to improve code quality is to refactor the code. However, Scratch lacks many of the common abstraction mechanisms used when refactoring programs written in higher programming languages. In order to improve Scratch code, we therefore propose a set of atomic code transformations to optimise readability by (1) rewriting control structures and (2) simplifying scripts using the inherently concurrent nature of Scratch programs. By automating these transformations it is possible to explore the space of possible variations of Scratch programs. In this paper, we describe a multi-objective search-based approach that determines sequences of code transformations which improve the readability of a given Scratch program and therefore form refactorings. Evaluation on a random sample of 1000 Scratch programs demonstrates that the generated refactorings reduce complexity and entropy in 70.4% of the cases, and 354 projects are improved in at least one metric without making any other metric worse. The refactored programs can help both novices and their teachers to improve their code.

SEAug 16, 2021
Data-driven Analysis of Gender Differences and Similarities in Scratch Programs

Isabella Graßl, Katharina Geldreich, Gordon Fraser

Block-based programming environments such as Scratch are an essential entry point to computer science. In order to create an effective learning environment that has the potential to address the gender imbalance in computer science, it is essential to better understand gender-specific differences in how children use such programming environments. In this paper, we explore gender differences and similarities in Scratch programs along two dimensions: In order to understand what motivates girls and boys to use Scratch, we apply a topic analysis using unsupervised machine learning for the first time on Scratch programs, using a dataset of 317 programs created by girls and boys in the range of 8-10 years. In order to understand how they program for these topics, we apply automated program analysis on the code implemented in these projects. We find that, in-line with common stereotypes, girls prefer topics that revolve around unicorns, celebrating, dancing and music, while boys tend to prefer gloomy topics with bats and ghouls, or competitive ones such as soccer or basketball. Girls prefer animations and stories, resulting in simpler control structures, while boys create games with more loops and conditional statements, resulting in more complex programs. Considering these differences can help to improve the learning outcomes and the resulting computing-related self-concepts, which are prerequisites for developing a longer-term interest in computer science.

SEAug 16, 2021
Effects of Hints on Debugging Scratch Programs: An Empirical Study with Primary School Teachers in Training

Luisa Greifenstein, Florian Obermüller, Ewald Wasmeier et al.

Bugs in learners' programs are often the result of fundamental misconceptions. Teachers frequently face the challenge of first having to understand such bugs, and then suggest ways to fix them. In order to enable teachers to do so effectively and efficiently, it is desirable to support them in recognising and fixing bugs. Misconceptions often lead to recurring patterns of similar bugs, enabling automated tools to provide this support in terms of hints on occurrences of common bug patterns. In this paper, we investigate to what extent the hints improve the effectiveness and efficiency of teachers in debugging learners' programs using a cohort of 163 primary school teachers in training, tasked to correct buggy Scratch programs, with and without hints on bug patterns. Our experiment suggests that automatically generated hints can reduce the effort of finding and fixing bugs from 8.66 to 5.24 minutes, while increasing the effectiveness by 34% more correct solutions. While this improvement is convincing, arguably teachers in training might first need to learn debugging "the hard way" to not miss the opportunity to learn by relying on tools. We therefore investigate whether the use of hints during training affects their ability to recognise and fix bugs without hints. Our experiment provides no significant evidence that either learning to debug with hints or learning to debug "the hard way" leads to better learning effects. Overall, this suggests that bug patterns might be a useful concept to include in the curriculum for teachers in training, while tool-support to recognise these patterns is desirable for teachers in practice.

SEAug 13, 2021
Code Perfumes: Reporting Good Code to Encourage Learners

Florian Obermüller, Lena Bloch, Luisa Greifenstein et al.

Block-based programming languages like Scratch enable children to be creative while learning to program. Even though the block-based approach simplifies the creation of programs, learning to program can nevertheless be challenging. Automated tools such as linters therefore support learners by providing feedback about potential bugs or code smells in their programs. Even when this feedback is elaborate and constructive, it still represents purely negative criticism and by construction ignores what learners have done correctly in their programs. In this paper we introduce an orthogonal approach to linting: We complement the criticism produced by a linter with positive feedback. We introduce the concept of code perfumes as the counterpart to code smells, indicating the correct application of programming practices considered to be good. By analysing not only what learners did wrong but also what they did right we hope to encourage learners, to provide teachers and students a better understanding of learners' progress, and to support the adoption of automated feedback tools. Using a catalogue of 25 code perfumes for Scratch, we empirically demonstrate that these represent frequent practices in Scratch, and we find that better programs indeed contain more code perfumes.

SEMay 12, 2021
Guiding Next-Step Hint Generation Using Automated Tests

Florian Obermüller, Ute Heuer, Gordon Fraser

Learning basic programming with Scratch can be hard for novices and tutors alike: Students may not know how to advance when solving a task, teachers may face classrooms with many raised hands at a time, and the problem is exacerbated when novices are on their own in online or virtual lessons. It is therefore desirable to generate next-step hints automatically to provide individual feedback for students who are stuck, but current approaches rely on the availability of multiple hand-crafted or hand-selected sample solutions from which to draw valid hints, and have not been adapted for Scratch. Automated testing provides an opportunity to automatically select suitable candidate solutions for hint generation, even from a pool of student solutions using different solution approaches and varying in quality. In this paper we present Catnip, the first next-step hint generation approach for Scratch, which extends existing data-driven hint generation approaches with automated testing. Evaluation of Catnip on a dataset of student Scratch programs demonstrates that the generated hints point towards functional improvements, and the use of automated tests allows the hints to be better individualized for the chosen solution path.

SEApr 23, 2021
SnapCheck: Automated Testing for Snap Programs

Wengran Wang, Chenhao Zhang, Andreas Stahlbauer et al.

Programming environments such as Snap, Scratch, and Processing engage learners by allowing them to create programming artifacts such as apps and games, with visual and interactive output. Learning programming with such a media-focused context has been shown to increase retention and success rate. However, assessing these visual, interactive projects requires time and laborious manual effort, and it is therefore difficult to offer automated or real-time feedback to students as they work. In this paper, we introduce SnapCheck, a dynamic testing framework for Snap that enables instructors to author test cases with Condition-Action templates. The goal of SnapCheck is to allow instructors or researchers to author property-based test cases that can automatically assess students' interactive programs with high accuracy. Our evaluation of SnapCheck on 162 code snapshots from a Pong game assignment in an introductory programming course shows that our automated testing framework achieves at least 98% accuracy over all rubric items, showing potentials to use SnapCheck for auto-grading and providing formative feedback to students.

SEMar 12, 2021
Does mutation testing improve testing practices?

Goran Petrović, Marko Ivanković, Gordon Fraser et al.

Various proxy metrics for test quality have been defined in order to guide developers when writing tests. Code coverage is particularly well established in practice, even though the question of how coverage relates to test quality is a matter of ongoing debate. Mutation testing offers a promising alternative: Artificial defects can identify holes in a test suite, and thus provide concrete suggestions for additional tests. Despite the obvious advantages of mutation testing, it is not yet well established in practice. Until recently, mutation testing tools and techniques simply did not scale to complex systems. Although they now do scale, a remaining obstacle is lack of evidence that writing tests for mutants actually improves test quality. In this paper we aim to fill this gap: By analyzing a large dataset of almost 15 million mutants, we investigate how these mutants influenced developers over time, and how these mutants relate to real faults. Our analyses suggest that developers using mutation testing write more tests, and actively improve their test suites with high quality tests such that fewer mutants remain. By analyzing a dataset of past fixes of real high-priority faults, our analyses further provide evidence that mutants are indeed coupled with real faults. In other words, had mutation testing been used for the changes introducing the faults, it would have reported a live mutant that could have prevented the bug.

SEFeb 22, 2021
Practical Mutation Testing at Scale

Goran Petrović, Marko Ivanković, Gordon Fraser et al.

Mutation analysis assesses a test suite's adequacy by measuring its ability to detect small artificial faults, systematically seeded into the tested program. Mutation analysis is considered one of the strongest test-adequacy criteria. Mutation testing builds on top of mutation analysis and is a testing technique that uses mutants as test goals to create or improve a test suite. Mutation testing has long been considered intractable because the sheer number of mutants that can be created represents an insurmountable problem -- both in terms of human and computational effort. This has hindered the adoption of mutation testing as an industry standard. For example, Google has a codebase of two billion lines of code and more than 500,000,000 tests are executed on a daily basis. The traditional approach to mutation testing does not scale to such an environment. To address these challenges, this paper presents a scalable approach to mutation testing based on the following main ideas: (1) Mutation testing is done incrementally, mutating only changed code during code review, rather than the entire code base; (2) Mutants are filtered, removing mutants that are likely to be irrelevant to developers, and limiting the number of mutants per line and per code review process; (3) Mutants are selected based on the historical performance of mutation operators, further eliminating irrelevant mutants and improving mutant quality. Evaluation in a code-review-based setting with more than 24,000 developers on more than 1,000 projects shows that the proposed approach produces orders of magnitude fewer mutants and that context-based mutant filtering and selection improve mutant quality and actionability. Overall, the proposed approach represents a mutation testing framework that seamlessly integrates into the software development workflow and is applicable up to large-scale industrial settings.

SEFeb 15, 2021
Finding Anomalies in Scratch Assignments

Nina Körber, Katharina Geldreich, Andreas Stahlbauer et al.

In programming education, teachers need to monitor and assess the progress of their students by investigating the code they write. Code quality of programs written in traditional programming languages can be automatically assessed with automated tests, verification tools, or linters. In many cases these approaches rely on some form of manually written formal specification to analyze the given programs. Writing such specifications, however, is hard for teachers, who are often not adequately trained for this task. Furthermore, automated tool support for popular block-based introductory programming languages like Scratch is lacking. Anomaly detection is an approach to automatically identify deviations of common behavior in datasets without any need for writing a specification. In this paper, we use anomaly detection to automatically find deviations of Scratch code in a classroom setting, where anomalies can represent erroneous code, alternative solutions, or distinguished work. Evaluation on solutions of different programming tasks demonstrates that anomaly detection can successfully be applied to tightly specified as well as open-ended programming tasks.

CYFeb 12, 2021
Gradeer: An Open-Source Modular Hybrid Grader

Benjamin Clegg, Maria-Cruz Villa-Uriol, Phil McMinn et al.

Automated assessment has been shown to greatly simplify the process of assessing students' programs. However, manual assessment still offers benefits to both students and tutors. We introduce Gradeer, a hybrid assessment tool, which allows tutors to leverage the advantages of both automated and manual assessment. The tool features a modular design, allowing new grading functionality to be added. Gradeer directly assists manual grading, by automatically loading code inspectors, running students' programs, and allowing grading to be stopped and resumed in place at a later time. We used Gradeer to assess an end of year assignment for an introductory Java programming course, and found that its hybrid approach offers several benefits.

SESep 9, 2020
Search-based Testing for Scratch Programs

Adina Deiner, Christoph Frädrich, Gordon Fraser et al.

Block-based programming languages enable young learners to quickly implement fun programs and games. The Scratch programming environment is particularly successful at this, with more than 50 million registered users at the time of this writing. Although Scratch simplifies creating syntactically correct programs, learners and educators nevertheless frequently require feedback and support. Dynamic program analysis could enable automation of this support, but the test suites necessary for dynamic analysis do not usually exist for Scratch programs. It is, however, possible to cast test generation for Scratch as a search problem. In this paper, we introduce an approach for automatically generating test suites for Scratch programs using grammatical evolution. The use of grammatical evolution clearly separates the search encoding from framework-specific implementation details, and allows us to use advanced test acceleration techniques. We implemented our approach as an extension of the Whisker test framework. Evaluation on sample Scratch programs demonstrates the potential of the approach.

CYAug 28, 2020
An Experience of Introducing Primary School Children to Programming using Ozobots (Practical Report)

Nina Körber, Lisa Bailey, Luisa Greifenstein et al.

Algorithmic thinking is a central concept in the context of computational thinking, and it is commonly taught by computer programming. A recent trend is to introduce basic programming concepts already very early on at primary school level. There are, however, several challenges in teaching programming at this level: Schools and teachers are often neither equipped nor trained appropriately, and the best way to move from initial "unplugged" activities to creating programs on a computer are still a matter of open debate. In this paper, we describe our experience of a small INTERREG-project aiming at supporting local primary schools in introducing children to programming concepts using Ozobot robots. These robots have two distinct advantages: First, they can be programmed with and without computers, thus helping the transition from unplugged programming to programming with a computer. Second, they are small and easy to transport, even when used together with tablet computers. Although we learned in our outreach events that the use of Ozobots is not without challenges, our overall experience is positive and can hopefully support others in setting up first encounters with programming at primary schools.

SEJul 28, 2020
Automated Unit Test Generation for Python

Stephan Lukasczyk, Florian Kroiß, Gordon Fraser

Automated unit test generation is an established research field, and mature test generation tools exist for statically typed programming languages such as Java. It is, however, substantially more difficult to automatically generate supportive tests for dynamically typed programming languages such as Python, due to the lack of type information and the dynamic nature of the language. In this paper, we describe a foray into the problem of unit test generation for dynamically typed languages. We introduce Pynguin, an automated unit test generation framework for Python. Using Pynguin, we aim to empirically shed light on two central questions: (1) Do well-established search-based test generation methods, previously evaluated only on statically typed languages, generalise to dynamically typed languages? (2) What is the influence of incomplete type information and dynamic typing on the problem of automated test generation? Our experiments confirm that evolutionary algorithms can outperform random test generation also in the context of Python, and can even alleviate the problem of absent type information to some degree. However, our results demonstrate that dynamic typing nevertheless poses a fundamental issue for test generation, suggesting future work on integrating type inference.

SEDec 14, 2019
IMPRESS: Improving Engagement in Software Engineering Courses through Gamification

Tanja E. J. Vos, I. S. W. B. Prasetya, Gordon Fraser et al.

Software Engineering courses play an important role for preparing students with the right knowledge and attitude for software development in practice. The implication is far reaching, as the quality of the software that we use ultimately depends on the quality of the people that make them. Educating Software Engineering, however, is quite challenging, as the subject is not considered as most exciting by students, while teachers often have to deal with exploding number of students. The EU project IMPRESS seeks to explore the use of gamification in educating software engineering at the university level to improve students' engagement and hence their appreciation for the taught subjects. This paper presents the project, its objectives, and its current progress.

SEAug 10, 2016
Uncertainty-Driven Black-Box Test Data Generation

Neil Walkinshaw, Gordon Fraser

We can never be certain that a software system is correct simply by testing it, but with every additional successful test we become less uncertain about its correctness. In absence of source code or elaborate specifications and models, tests are usually generated or chosen randomly. However, rather than randomly choosing tests, it would be preferable to choose those tests that decrease our uncertainty about correctness the most. In order to guide test generation, we apply what is referred to in Machine Learning as "Query Strategy Framework": We infer a behavioural model of the system under test and select those tests which the inferred model is "least certain" about. Running these tests on the system under test thus directly targets those parts about which tests so far have failed to inform the model. We provide an implementation that uses a genetic programming engine for model inference in order to enable an uncertainty sampling technique known as "query by committee", and evaluate it on eight subject systems from the Apache Commons Math framework and JodaTime. The results indicate that test generation using uncertainty sampling outperforms conventional and Adaptive Random Testing.

SEJul 20, 2014
Inferring Loop Invariants by Mutation, Dynamic Analysis, and Static Checking

Juan P. Galeotti, Carlo A. Furia, Eva May et al.

Verifiers that can prove programs correct against their full functional specification require, for programs with loops, additional annotations in the form of loop invariants---propeties that hold for every iteration of a loop. We show that significant loop invariant candidates can be generated by systematically mutating postconditions; then, dynamic checking (based on automatically generated tests) weeds out invalid candidates, and static checking selects provably valid ones. We present a framework that automatically applies these techniques to support a program prover, paving the way for fully automatic verification without manually written loop invariants: Applied to 28 methods (including 39 different loops) from various java.util classes (occasionally modified to avoid using Java features not fully supported by the static checker), our DYNAMATE prototype automatically discharged 97% of all proof obligations, resulting in automatic complete correctness proofs of 25 out of the 28 methods---outperforming several state-of-the-art tools for fully automatic verification.

SEMar 12, 2013
Using State Infection Conditions to Detect Equivalent Mutants and Speed up Mutation Analysis

René Just, Michael D. Ernst, Gordon Fraser

Mutation analysis evaluates test suites and testing techniques by measuring how well they detect seeded defects (mutants). Even though well established in research, mutation analysis is rarely used in practice due to scalability problems --- there are multiple mutations per code statement leading to a large number of mutants, and hence executions of the test suite. In addition, the use of mutation to improve test suites is futile for mutants that are equivalent, which means that there exists no test case that distinguishes them from the original program. This paper introduces two optimizations based on state infection conditions, i.e., conditions that determine for a test execution whether the same execution on a mutant would lead to a different state. First, redundant test execution can be avoided by monitoring state infection conditions, leading to an overall performance improvement. Second, state infection conditions can aid in identifying equivalent mutants, thus guiding efforts to improve test suites.