Richard Torkar

24papers

775citations

Novelty23%

AI Score42

Ranked #83,538 of 201,326 authors (top 41%)#998 in SE (top 29%)

24 Papers

SEMar 19

Mitigating Omitted Variable Bias in Empirical Software Engineering

Carlo A. Furia, Richard Torkar

Omitted variable bias occurs when a statistical model leaves out variables that are relevant determinants of the effects under study. This results in the model attributing the missing variables' effect to some of the included variables -- hence over- or under-estimating the latter's true effect. Omitted variable bias presents a significant threat to the validity of empirical research, particularly in non-experimental studies such as those prevalent in empirical software engineering. This paper illustrates the impact of omitted variable bias on two illustrative examples in the software engineering domain, and uses them to present methods to investigate the possible presence of omitted variable bias, to estimate its impact, and to mitigate its drawbacks. The analysis techniques we present are based on causal structural models of the variables of interest, which provide a practical, intuitive summary of the key relations among variables. This paper demonstrates a sequence of analysis steps that inform the design and execution of any empirical study in software engineering. An important observation is that it pays off to invest effort investigating omitted variable bias before actually executing an empirical study, because this effort can lead to a more solid study design, and to a significant reduction in its threats to validity.

SEJun 22, 2019Code

The Connection Between Burnout and Personality Types in Software Developers

Emanuel Mellblom, Isar Arason, Lucas Gren et al.

This paper examines the connection between the Five Factor Model personality traits and burnout in software developers. This study aims to validate generalizations of findings in other fields. An online survey consisting of a miniaturized International Personality Item Pool questionnaire for measuring the Five Factor Model personality traits, and the Shirom-Melamed Burnout Measure for measuring burnout, were distributed to open source developer mailing lists, obtaining 47 valid responses. The results from a Bayesian Linear Regression analysis indicate a strong link between neuroticism and burnout confirming previous work, while the other Five Factor Model traits were not adding power to the model. It is important to note that we did not investigate the quality of work in connection to personality, nor did we take any other confounding factors into account like, for example, teamwork. Nonetheless, employers could be aware of, and support, software developers with high neuroticism.

SEApr 9

Towards Improving the External Validity of Software Engineering Experiments with Transportability Methods

Julian Frattini, Richard Torkar, Robert Feldt et al.

Controlled experiments are a core research method in software engineering (SE) for validating causal claims. However, recruiting a sample of participants that represents the intended target population is often difficult or expensive, which limits the external validity of experimental results. At the same time, SE researchers often have access to much larger amounts of observational than experimental data (e.g., from repositories, issue trackers, logs, surveys and industrial processes). Transportability methods combine these data from experimental and observational studies to "transport" results from the experimental sample to a broader, more representative sample of the target population. Although the ability to combine observational and experimental data in a principled way could substantially benefit empirical SE research, transportability methods have - to our knowledge - not been adopted in SE. In this vision, we aim to help make that adoption possible. To that end, we introduce transportability methods, their prerequisites, and demonstrate their potential through a simulation. We then outline several SE research scenarios in which these methods could apply, e.g., how to effectively use students as substitutes for developers. Finally, we outline a road map and practical guidelines to support SE researchers in applying them. Adopting transportability methods in SE research can strengthen the external validity of controlled experiments and help the field produce results that are both more reliable and more useful in practice.

LGFeb 2

Causality--Δ: Jacobian-Based Dependency Analysis in Flow Matching Models

Reza Rezvan, Gustav Gille, Moritz Schauer et al.

Flow matching learns a velocity field that transports a base distribution to data. We study how small latent perturbations propagate through these flows and show that Jacobian-vector products (JVPs) provide a practical lens on dependency structure in the generated features. We derive closed-form expressions for the optimal drift and its Jacobian in Gaussian and mixture-of-Gaussian settings, revealing that even globally nonlinear flows admit local affine structure. In low-dimensional synthetic benchmarks, numerical JVPs recover the analytical Jacobians. In image domains, composing the flow with an attribute classifier yields an attribute-level JVP estimator that recovers empirical correlations on MNIST and CelebA. Conditioning on small classifier-Jacobian norms reduces correlations in a way consistent with a hypothesized common-cause structure, while we emphasize that this conditioning is not a formal do intervention.

SESep 11, 2021

Take a deep breath. Benefits of neuroplasticity practices for software developers and computer workers in a family of experiments

Birgit Penzenstadler, Richard Torkar, Cristina Martinez Montes

Context. Computer workers in general, and software developers specifically, are under a high amount of stress due to continuous deadlines and, often, over-commitment. Objective. This study investigates the effects of a neuroplasticity practice, a specific breathing practice, on the attention awareness, well-being, perceived productivity, and self-efficacy of computer workers. Method. We created a questionnaire mainly from existing, validated scales as entry and exit survey for data points for comparison before and after the intervention. The intervention was a 12-week program with a weekly live session that included a talk on a well-being topic and a facilitated group breathing session. During the intervention period, we solicited one daily journal note and one weekly well-being rating. We replicated the intervention in a similarly structured 8-week program. The data was analyzed using a Bayesian multi-level model for the quantitative part and thematic analysis for the qualitative part. Results. The intervention showed improvements in participants' experienced inner states despite an ongoing pandemic and intense outer circumstances for most. Over the course of the study, we found an improvement in the participants' ratings of how often they found themselves in good spirits as well as in a calm and relaxed state. We also aggregate a large number of deep inner reflections and growth processes that may not have surfaced for the participants without deliberate engagement in such a program. Conclusion. The data indicates usefulness and effectiveness of an intervention for computer workers in terms of increasing well-being and resilience. Everyone needs a way to deliberately relax, unplug, and recover. Breathing practice is a simple way to do so, and the results call for establishing a larger body of work to make this common practice.

SEApr 13, 2021

Not All Requirements Prioritization Criteria Are Equal at All Times: A Quantitative Analysis

Richard Berntsson Svensson, Richard Torkar

Requirement prioritization is recognized as an important decision-making activity in requirements engineering and software development. Requirement prioritization is applied to determine which requirements should be implemented and released. In order to prioritize requirements, there are several approaches/techniques/tools that use different requirements prioritization criteria, which are often identified by gut feeling instead of an in-depth analysis of which criteria are most important to use. Therefore, in this study we investigate which requirements prioritization criteria are most important to use in industry when determining which requirements are implemented and released, and if the importance of the criteria change depending on how far a requirement has reached in the development process. We conducted a quantitative study of one completed project from one software developing company by extracting 32,139 requirements prioritization decisions based on eight requirements prioritization criteria for 11,110 requirements. The results show that not all requirements prioritization criteria are equally important, and this change depending on how far a requirement has reached in the development process.

SEJan 29, 2021

Applying Bayesian Analysis Guidelines to Empirical Software Engineering Data: The Case of Programming Languages and Code Quality

Carlo A. Furia, Richard Torkar, Robert Feldt

Statistical analysis is the tool of choice to turn data into information, and then information into empirical knowledge. To be valid, the process that goes from data to knowledge should be supported by detailed, rigorous guidelines, which help ferret out issues with the data or model, and lead to qualified results that strike a reasonable balance between generality and practical relevance. Such guidelines are being developed by statisticians to support the latest techniques for Bayesian data analysis. In this article, we frame these guidelines in a way that is apt to empirical research in software engineering. To demonstrate the guidelines in practice, we apply them to reanalyze a GitHub dataset about code quality in different programming languages. The dataset's original analysis (Ray et al., 2014) and a critical reanalysis (Berger at al., 2019) have attracted considerable attention -- in no small part because they target a topic (the impact of different programming languages) on which strong opinions abound. The goals of our reanalysis are largely orthogonal to this previous work, as we are concerned with demonstrating, on data in an interesting domain, how to build a principled Bayesian data analysis and to showcase some of its benefits. In the process, we will also shed light on some critical aspects of the analyzed data and of the relationship between programming languages and code quality. The high-level conclusions of our exercise will be that Bayesian statistical techniques can be applied to analyze software engineering data in a way that is principled, flexible, and leads to convincing results that inform the state of the art while highlighting the boundaries of its validity. The guidelines can support building solid statistical analyses and connecting their results, and hence help buttress continued progress in empirical software engineering research.

SESep 22, 2020

Measuring affective states from technical debt: A psychoempirical software engineering experiment

Jesper Olsson, Erik Risfelt, Terese Besker et al.

Software engineering is a human activity. Despite this, human aspects are under-represented in technical debt research, perhaps because they are challenging to evaluate. This study's objective was to investigate the relationship between technical debt and affective states (feelings, emotions, and moods) from software practitioners. Forty participants (N = 40) from twelve companies took part in a mixed-methods approach, consisting of a repeated-measures (r = 5) experiment (n = 200), a survey, and semi-structured interviews. The statistical analysis shows that different design smells (strong indicators of technical debt) negatively or positively impact affective states. From the qualitative data, it is clear that technical debt activates a substantial portion of the emotional spectrum and is psychologically taxing. Further, the practitioners' reactions to technical debt appear to fall in different levels of maturity. We argue that human aspects in technical debt are important factors to consider, as they may result in, e.g., procrastination, apprehension, and burnout.

SEJul 18, 2020

An empirical study of Linespots: A novel past-fault algorithm

Maximilian Scholz, Richard Torkar

This paper proposes the novel past-faults fault prediction algorithm Linespots, based on the Bugspots algorithm. We analyze the predictive performance and runtime of Linespots compared to Bugspots with an empirical study using the most significant self-built dataset as of now, including high-quality samples for validation. As a novelty in fault prediction, we use Bayesian data analysis and Directed Acyclic Graphs to model the effects. We found consistent improvements in the predictive performance of Linespots over Bugspots for all seven evaluation metrics. We conclude that Linespots should be used over Bugspots in all cases where no real-time performance is necessary.

SEMay 3, 2020

Pandemic Programming: How COVID-19 affects software developers and how their organizations can help

Paul Ralph, Sebastian Baltes, Gianisa Adisaputri et al.

Context. As a novel coronavirus swept the world in early 2020, thousands of software developers began working from home. Many did so on short notice, under difficult and stressful conditions. Objective. This study investigates the effects of the pandemic on developers' wellbeing and productivity. Method. A questionnaire survey was created mainly from existing, validated scales and translated into 12 languages. The data was analyzed using non-parametric inferential statistics and structural equation modeling. Results. The questionnaire received 2225 usable responses from 53 countries. Factor analysis supported the validity of the scales and the structural model achieved a good fit (CFI = 0.961, RMSEA = 0.051, SRMR = 0.067). Confirmatory results include: (1) the pandemic has had a negative effect on developers' wellbeing and productivity; (2) productivity and wellbeing are closely related; (3) disaster preparedness, fear related to the pandemic and home office ergonomics all affect wellbeing or productivity. Exploratory analysis suggests that: (1) women, parents and people with disabilities may be disproportionately affected; (2) different people need different kinds of support. Conclusions. To improve employee productivity, software companies should focus on maximizing employee wellbeing and improving the ergonomics of employees' home offices. Women, parents and disabled persons may require extra support.

SEJul 8, 2019

Estimating Return on Investment for GUI Test Automation Tools

Felix Dobslaw, Robert Feldt, David Michaelsson et al.

Automated graphical user interface (GUI) tests can reduce manual testing activities and increase test frequency. This motivates the conversion of manual test cases into automated GUI tests. However, it is not clear whether such automation is cost-effective given that GUI automation scripts add to the code base and demand maintenance as a system evolves. In this paper, we introduce a method for estimating maintenance cost and Return on Investment (ROI) for Automated GUI Testing (AGT). The method utilizes the existing source code change history and can be used for evaluation also of other testing or quality assurance automation technologies. We evaluate the method for a real-world, industrial software system and compare two fundamentally different AGT tools, namely Selenium and EyeAutomate, to estimate and compare their ROI. We also report on their defect-finding capabilities and usability. The quantitative data is complemented by interviews with employees at the case company. The method was successfully applied and estimated maintenance cost and ROI for both tools are reported. Overall, the study supports earlier results showing that implementation time is the leading cost for introducing AGT. The findings further suggest that while EyeAutomate tests are significantly faster to implement, Selenium tests require more of a programming background but less maintenance.

SEApr 8, 2019

The Unfulfilled Potential of Data-Driven Decision Making in Agile Software Development

Richard Berntsson Svensson, Robert Feldt, Richard Torkar

With the general trend towards data-driven decision making (DDDM), organizations are looking for ways to use DDDM to improve their decisions. However, few studies have looked into the practitioners view of DDDM, in particular for agile organizations. In this paper we investigated the experiences of using DDDM, and how data can improve decision making. An emailed questionnaire was sent out to 124 industry practitioners in agile software developing companies, of which 84 answered. The results show that few practitioners indicated a widespread use of DDDM in their current decision making practices. The practitioners were more positive to its future use for higher-level and more general decision making, fairly positive to its use for requirements elicitation and prioritization decisions, while being less positive to its future use at the team level. The practitioners do see a lot of potential for DDDM in an agile context; however, currently unfulfilled.

SEApr 4, 2019

Group development and group maturity when building agile teams: A qualitative and quantitative investigation at eight large companies

Lucas Gren, Richard Torkar, Robert Feldt

The agile approach to projects focuses more on close-knit teams than traditional waterfall projects, which means that aspects of group maturity become even more important. This psychological aspect is not much researched in connection to the building of an "agile team." The purpose of this study is to investigate how building agile teams is connected to a group development model taken from social psychology. We conducted ten semi-structured interviews with coaches, Scrum Masters, and managers responsible for the agile process from seven different companies, and collected survey data from 66 group-members from four companies (a total of eight different companies). The survey included an agile measurement tool and the one part of the Group Development Questionnaire. The results show that the practitioners define group developmental aspects as key factors to a successful agile transition. Also, the quantitative measurement of agility was significantly correlated to the group maturity measurement. We conclude that adding these psychological aspects to the description of the "agile team" could increase the understanding of agility and partly help define an "agile team." We propose that future work should develop specific guidelines for how software development teams at different maturity levels might adopt agile principles and practices differently.

SEApr 4, 2019

Group Maturity and Agility, Are They Connected? - A Survey Study

Lucas Gren, Richard Torkar, Robert Feldt

The focus on psychology has increased within software engineering due to the project management innovation "agile development processes". The agile methods do not explicitly consider group development aspects; they simply assume what is described in group psychology as mature groups. This study was conducted with 45 employees and their twelve managers (N=57) from two SAP customers in the US that were working with agile methods, and the data were collected via an online survey. The selected Agility measurement was correlated to a Group Development measurement and showed significant convergent validity, i.e., a more mature team is also a more agile team. This means that the agile methods probably would benefit from taking group development into account when its practices are being introduced.

SEApr 4, 2019

The prospects of a quantitative measurement of agility: A validation study on an agile maturity model

Lucas Gren, Richard Torkar, Robert Feldt

Agile development has now become a well-known approach to collaboration in professional work life. Both researchers and practitioners want validated tools to measure agility. This study sets out to validate an agile maturity measurement model with statistical tests and empirical data. First, a pretest was conducted as a case study including a survey and focus group. Second, the main study was conducted with 45 employees from two SAP customers in the US. We used internal consistency (by a Cronbach's alpha) as the main measure for reliability and analyzed construct validity by exploratory principal factor analysis (PFA). The results suggest a new categorization of a subset of items existing in the tool and provides empirical support for these new groups of factors. However, we argue that more work is needed to reach the point where a maturity models with quantitative data can be said to validly measure agility, and even then, such a measurement still needs to include some deeper analysis with cultural and contextual items.

SEApr 4, 2019

Work Motivational Challenges Regarding the Interface Between Agile Teams and a Non-Agile Surrounding Organization: A case study

Lucas Gren, Richard Torkar, Robert Feldt

There are studies showing what happens if agile teams are introduced into a non-agile organization, e.g. higher overhead costs and the necessity of an understanding of agile methods even outside the teams. This case study shows an example of work motivational aspects that might surface when an agile team exists in the middle of a more traditional structure. This case study was conducted at a car manufacturer in Sweden, consisting of an unstructured interview with the Scrum Master and a semi-structured focus group. The results show that the teams felt that the feedback from the surrounding organization was unsynchronized resulting in them not feeling appreciated when delivering their work. Moreover, they felt frustrated when working on non-agile teams after have been working on agile ones. This study concludes that there were work motivational affects of fitting an agile team into a non-agile surrounding organization, and therefore this might also be true for other organizations.

SEApr 1, 2019

Bayesian data analysis in empirical software engineering---The case of missing data

Richard Torkar, Robert Feldt, Carlo A. Furia

Bayesian data analysis (BDA) is today used by a multitude of research disciplines. These disciplines use BDA as a way to embrace uncertainty by using multilevel models and making use of all available information at hand. In this chapter, we first introduce the reader to BDA and then provide an example from empirical software engineering, where we also deal with a common issue in our field, i.e., missing data. The example we make use of presents the steps done when conducting state of the art statistical analysis. First, we need to understand the problem we want to solve. Second, we conduct causal analysis. Third, we analyze non-identifiability. Fourth, we conduct missing data analysis. Finally, we do a sensitivity analysis of priors. All this before we design our statistical model. Once we have a model, we present several diagnostics one can use to conduct sanity checks. We hope that through these examples, the reader will see the advantages of using BDA. This way, we hope Bayesian statistics will become more prevalent in our field, thus partly avoiding the reproducibility crisis we have seen in other disciplines.

SENov 13, 2018

Bayesian Data Analysis in Empirical Software Engineering Research

Carlo A. Furia, Robert Feldt, Richard Torkar

Statistics comes in two main flavors: frequentist and Bayesian. For historical and technical reasons, frequentist statistics have traditionally dominated empirical data analysis, and certainly remain prevalent in empirical software engineering. This situation is unfortunate because frequentist statistics suffer from a number of shortcomings---such as lack of flexibility and results that are unintuitive and hard to interpret---that curtail their effectiveness when dealing with the heterogeneous data that is increasingly available for empirical analysis of software engineering practice. In this paper, we pinpoint these shortcomings, and present Bayesian data analysis techniques that provide tangible benefits---as they can provide clearer results that are simultaneously robust and nuanced. After a short, high-level introduction to the basic tools of Bayesian statistics, we present the reanalysis of two empirical studies on the effectiveness of automatically generated tests and the performance of programming languages. By contrasting the original frequentist analyses with our new Bayesian analyses, we demonstrate the concrete advantages of the latter. To conclude we advocate a more prominent role for Bayesian statistical techniques in empirical software engineering research and practice.

SESep 26, 2018

A Method to Assess and Argue for Practical Significance in Software Engineering

Richard Torkar, Carlo A. Furia, Robert Feldt et al.

A key goal of empirical research in software engineering is to assess practical significance, which answers whether the observed effects of some compared treatments show a relevant difference in practice in realistic scenarios. Even though plenty of standard techniques exist to assess statistical significance, connecting it to practical significance is not straightforward or routinely done; indeed, only a few empirical studies in software engineering assess practical significance in a principled and systematic way. In this paper, we argue that Bayesian data analysis provides suitable tools to assess practical significance rigorously. We demonstrate our claims in a case study comparing different test techniques. The case study's data was previously analyzed (Afzal et al., 2015) using standard techniques focusing on statistical significance. Here, we build a multilevel model of the same data, which we fit and validate using Bayesian techniques. Our method is to apply cumulative prospect theory on top of the statistical model to quantitatively connect our statistical analysis output to a practically meaningful context. This is then the basis both for assessing and arguing for practical significance. Our study demonstrates that Bayesian analysis provides a technically rigorous yet practical framework for empirical software engineering. A substantial side effect is that any uncertainty in the underlying data will be propagated through the statistical model, and its effects on practical significance are made clear. Thus, in combination with cumulative prospect theory, Bayesian analysis supports seamlessly assessing practical significance in an empirical software engineering context, thus potentially clarifying and extending the relevance of research for practitioners.

SEApr 24, 2018

Transferring Interactive Search-Based Software Testing to Industry

Bogdan Marculescu, Robert Feldt, Richard Torkar et al.

Search-Based Software Testing (SBST) is the application of optimization algorithms to problems in software testing. In previous work, we have implemented and evaluated Interactive Search-Based Software Testing (ISBST) tool prototypes, with a goal to successfully transfer the technique to industry. While SBSE solutions are often validated on benchmark problems, there is a need to validate them in an operational setting. The present paper discusses the development and deployment of SBST tools for use in industry and reflects on the transfer of these techniques to industry. In addition to previous work discussing the development and validation of an ISBST prototype, a new version of the prototype ISBST system was evaluated in the laboratory and in industry. This evaluation is based on an industrial System under Test (SUT) and was carried out with industrial practitioners. The Technology Transfer Model is used as a framework to describe the progression of the development and evaluation of the ISBST system. The paper presents a synthesis of previous work developing and evaluating the ISBST prototype, as well as presenting an evaluation, in both academia and industry, of that prototype's latest version. This paper presents an overview of the development and deployment of the ISBST system in an industrial setting, using the framework of the Technology Transfer Model. We conclude that the ISBST system is capable of evolving useful test cases for that setting, though improvements in the means the system uses to communicate that information to the user are still required. In addition, a set of lessons learned from the project are listed and discussed. Our objective is to help other researchers that wish to validate search-based systems in industry and provide more information about the benefits and drawbacks of these systems.

SEFeb 20, 2018

A Testability Analysis Framework for Non-Functional Properties

Michael Felderer, Bogdan Marculescu, Francisco Gomes de Oliveira Neto et al.

This paper presents background, the basic steps and an example for a testability analysis framework for non-functional properties.

SEFeb 6, 2018

Ways of Applying Artificial Intelligence in Software Engineering

Robert Feldt, Francisco G. de Oliveira Neto, Richard Torkar

As Artificial Intelligence (AI) techniques have become more powerful and easier to use they are increasingly deployed as key components of modern software systems. While this enables new functionality and often allows better adaptation to user needs it also creates additional problems for software engineers and exposes companies to new risks. Some work has been done to better understand the interaction between Software Engineering and AI but we lack methods to classify ways of applying AI in software systems and to analyse and understand the risks this poses. Only by doing so can we devise tools and solutions to help mitigate them. This paper presents the AI in SE Application Levels (AI-SEAL) taxonomy that categorises applications according to their point of AI application, the type of AI technology used and the automation level allowed. We show the usefulness of this taxonomy by classifying 15 papers from previous editions of the RAISE workshop. Results show that the taxonomy allows classification of distinct AI applications and provides insights concerning the risks associated with them. We argue that this will be important for companies in deciding how to apply AI in their software applications and to create strategies for its use.

SEJun 3, 2017

Evolution of statistical analysis in empirical software engineering research: Current state and steps forward

Francisco Gomes de Oliveira Neto, Richard Torkar, Robert Feldt et al.

Software engineering research is evolving and papers are increasingly based on empirical data from a multitude of sources, using statistical tests to determine if and to what degree empirical evidence supports their hypotheses. To investigate the practices and trends of statistical analysis in empirical software engineering (ESE), this paper presents a review of a large pool of papers from top-ranked software engineering journals. First, we manually reviewed 161 papers and in the second phase of our method, we conducted a more extensive semi-automatic classification of papers spanning the years 2001--2015 and 5,196 papers. Results from both review steps was used to: i) identify and analyze the predominant practices in ESE (e.g., using t-test or ANOVA), as well as relevant trends in usage of specific statistical methods (e.g., nonparametric tests and effect size measures) and, ii) develop a conceptual model for a statistical analysis workflow with suggestions on how to apply different statistical methods as well as guidelines to avoid pitfalls. Lastly, we confirm existing claims that current ESE practices lack a standard to report practical significance of results. We illustrate how practical significance can be discussed in terms of both the statistical analysis and in the practitioner's context.

SEDec 15, 2015

Tester Interactivity makes a Difference in Search-Based Software Testing: A Controlled Experiment

Bogdan Marculescu, Simon Poulding, Robert Feldt et al.

Context: Search-based software testing promises to provide users with the ability to generate high-quality test cases, and hence increase product quality, with a minimal increase in the time and effort required. One result that emerged out of a previous study to investigate the application of search-based software testing (SBST) in an industrial setting was the development of the Interactive Search-Based Software Testing (ISBST) system. ISBST allows users to interact with the underlying SBST system, guiding the search and assessing the results. An industrial evaluation indicated that the ISBST system could find test cases that are not created by testers employing manual techniques. The validity of the evaluation was threatened, however, by the low number of participants. Objective: This paper presents a follow-up study, to provide a more rigorous evaluation of the ISBST system. Method: To assess the ISBST system a two-way crossover controlled experiment was conducted with 58 students taking a Verification and Validation course. The NASA Task Load Index (NASA-TLX) is used to assess the workload experienced by the participants in the experiment. Results: The experimental results validated the hypothesis that the ISBST system generates test cases that are not found by the same participants employing manual testing techniques. A follow-up laboratory experiment also investigates the importance of interaction in obtaining the results. In addition to this main result, the subjective workload was assessed for each participant by means of the NASA-TLX tool. The evaluation showed that, while the ISBST system required more effort from the participants, they achieved the same performance. Conclusions: The paper provides evidence that the ISBST system develops test cases that are not found by manual techniques, and that interaction plays an important role in achieving that result.