Burak Turhan

SE
h-index39
22papers
751citations
Novelty26%
AI Score23

22 Papers

CYJul 14, 2023
Ethics in the Age of AI: An Analysis of AI Practitioners' Awareness and Challenges

Aastha Pant, Rashina Hoda, Simone V. Spiegler et al.

Ethics in AI has become a debated topic of public and expert discourse in recent years. But what do people who build AI - AI practitioners - have to say about their understanding of AI ethics and the challenges associated with incorporating it in the AI-based systems they develop? Understanding AI practitioners' views on AI ethics is important as they are the ones closest to the AI systems and can bring about changes and improvements. We conducted a survey aimed at understanding AI practitioners' awareness of AI ethics and their challenges in incorporating ethics. Based on 100 AI practitioners' responses, our findings indicate that majority of AI practitioners had a reasonable familiarity with the concept of AI ethics, primarily due to workplace rules and policies. Privacy protection and security was the ethical principle that majority of them were aware of. Formal education/training was considered somewhat helpful in preparing practitioners to incorporate AI ethics. The challenges that AI practitioners faced in the development of ethical AI-based systems included (i) general challenges, (ii) technology-related challenges and (iii) human-related challenges. We also identified areas needing further investigation and provided recommendations to assist AI practitioners and companies in incorporating ethics into AI development.

SEFeb 26, 2024Code
Beyond Accuracy: An Empirical Study on Unit Testing in Open-source Deep Learning Projects

Han Wang, Sijia Yu, Chunyang Chen et al.

Deep Learning (DL) models have rapidly advanced, focusing on achieving high performance through testing model accuracy and robustness. However, it is unclear whether DL projects, as software systems, are tested thoroughly or functionally correct when there is a need to treat and test them like other software systems. Therefore, we empirically study the unit tests in open-source DL projects, analyzing 9,129 projects from GitHub. We find that: 1) unit tested DL projects have positive correlation with the open-source project metrics and have a higher acceptance rate of pull requests, 2) 68% of the sampled DL projects are not unit tested at all, 3) the layer and utilities (utils) of DL models have the most unit tests. Based on these findings and previous research outcomes, we built a mapping taxonomy between unit tests and faults in DL projects. We discuss the implications of our findings for developers and researchers and highlight the need for unit testing in open-source DL projects to ensure their reliability and stability. The study contributes to this community by raising awareness of the importance of unit testing in DL projects and encouraging further research in this area.

CYMar 21, 2024
Navigating Fairness: Practitioners' Understanding, Challenges, and Strategies in AI/ML Development

Aastha Pant, Rashina Hoda, Chakkrit Tantithamthavorn et al.

The rise in the use of AI/ML applications across industries has sparked more discussions about the fairness of AI/ML in recent times. While prior research on the fairness of AI/ML exists, there is a lack of empirical studies focused on understanding the perspectives and experiences of AI practitioners in developing a fair AI/ML system. Understanding AI practitioners' perspectives and experiences on the fairness of AI/ML systems are important because they are directly involved in its development and deployment and their insights can offer valuable real-world perspectives on the challenges associated with ensuring fairness in AI/ML systems. We conducted semi-structured interviews with 22 AI practitioners to investigate their understanding of what a 'fair AI/ML' is, the challenges they face in developing a fair AI/ML system, the consequences of developing an unfair AI/ML system, and the strategies they employ to ensure AI/ML system fairness. We developed a framework showcasing the relationship between AI practitioners' understanding of 'fair AI/ML' system and (i) their challenges in its development, (ii) the consequences of developing an unfair AI/ML system, and (iii) strategies used to ensure AI/ML system fairness. By exploring AI practitioners' perspectives and experiences, this study provides actionable insights to enhance AI/ML fairness, which may promote fairer systems, reduce bias, and foster public trust in AI technologies. Additionally, we also identify areas for further investigation and offer recommendations to aid AI practitioners and AI companies in navigating fairness.

SEOct 18, 2021
Use and Misuse of the Term Experiment in Mining Software Repositories Research

Claudia Ayala, Burak Turhan, Xavier Franch et al.

The significant momentum and importance of Mining Software Repositories (MSR) in Software Engineering (SE) has fostered new opportunities and challenges for extensive empirical research. However, MSR researchers seem to struggle to characterize the empirical methods they use into the existing empirical SE body of knowledge. This is especially the case of MSR experiments. To provide evidence on the special characteristics of MSR experiments and their differences with experiments traditionally acknowledged in SE so far, we elicited the hallmarks that differentiate an experiment from other types of empirical studies and characterized the hallmarks and types of experiments in MSR. We analyzed MSR literature obtained from a small-scale systematic mapping study to assess the use of the term experiment in MSR. We found that 19% of the papers claiming to be an experiment are indeed not an experiment at all but also observational studies, so they use the term in a misleading way. From the remaining 81% of the papers, only one of them refers to a genuine controlled experiment while the others stand for experiments with limited control. MSR researchers tend to overlook such limitations, compromising the interpretation of the results of their studies. We provide recommendations and insights to support the improvement of MSR experiments.

SEOct 6, 2021
How good does a Defect Predictor need to be to guide Search-Based Software Testing?

Anjana Perera, Burak Turhan, Aldeida Aleti et al.

Defect predictors, static bug detectors and humans inspecting the code can locate the parts of the program that are buggy before they are discovered through testing. Automated test generators such as search-based software testing (SBST) techniques can use this information to direct their search for test cases to likely buggy code, thus speeding up the process of detecting existing bugs. However, often the predictions given by these tools or humans are imprecise, which can misguide the SBST technique and may deteriorate its performance. In this paper, we study the impact of imprecision in defect prediction on the bug detection effectiveness of SBST. Our study finds that the recall of the defect predictor, i.e., the probability of correctly identifying buggy code, has a significant impact on bug detection effectiveness of SBST with a large effect size. On the other hand, the effect of precision, a measure for false alarms, is not of meaningful practical significance as indicated by a very small effect size. In particular, the SBST technique finds 7.5 less bugs on average (out of 420 bugs) for every 5% decrements of the recall. In the context of combining defect prediction and SBST, our recommendation for practice is to increase the recall of defect predictors at the expense of precision, while maintaining a precision of at least 75%. To account for the imprecision of defect predictors, in particular low recall values, SBST techniques should be designed to search for test cases that also cover the predicted non-buggy parts of the program, while prioritising the parts that have been predicted as buggy.

SEOct 5, 2021
Does Domain Change the Opinion of Individuals on Human Values? A Preliminary Investigation on eHealth Apps End-users

Humphrey Obie, Mojtaba Shahin, John Grundy et al.

The elicitation of end-users' human values - such as freedom, honesty, transparency, etc. - is important in the development of software systems. We carried out two preliminary Q-studies to understand (a) the general human value opinion types of eHealth applications (apps) end-users (b) the eHealth domain human value opinion types of eHealth apps end-users (c) whether there are differences between the general and eHealth domain opinion types. Our early results show three value opinion types using generic value instruments: (1) fun-loving, success-driven and independent end-user, (2) security-conscious, socially-concerned, and success-driven end-user, and (3) benevolent, success-driven, and conformist end-user Our results also show two value opinion types using domain-specific value instruments: (1) security-conscious, reputable, and honest end-user, and (2) success-driven, reputable and pain-avoiding end-user. Given these results, consideration should be given to domain context in the design and application of values elicitation instruments.

SESep 26, 2021
Defect Prediction Guided Search-Based Software Testing

Anjana Perera, Aldeida Aleti, Marcel Böhme et al.

Today, most automated test generators, such as search-based software testing (SBST) techniques focus on achieving high code coverage. However, high code coverage is not sufficient to maximise the number of bugs found, especially when given a limited testing budget. In this paper, we propose an automated test generation technique that is also guided by the estimated degree of defectiveness of the source code. Parts of the code that are likely to be more defective receive more testing budget than the less defective parts. To measure the degree of defectiveness, we leverage Schwa, a notable defect prediction technique. We implement our approach into EvoSuite, a state of the art SBST tool for Java. Our experiments on the Defects4J benchmark demonstrate the improved efficiency of defect prediction guided test generation and confirm our hypothesis that spending more time budget on likely defective parts increases the number of bugs found in the same time budget.

LGApr 2, 2021
A Comparison of Similarity Based Instance Selection Methods for Cross Project Defect Prediction

Seyedrebvar Hosseini, Burak Turhan

Context: Previous studies have shown that training data instance selection based on nearest neighborhood (NN) information can lead to better performance in cross project defect prediction (CPDP) by reducing heterogeneity in training datasets. However, neighborhood calculation is computationally expensive and approximate methods such as Locality Sensitive Hashing (LSH) can be as effective as exact methods. Aim: We aim at comparing instance selection methods for CPDP, namely LSH, NN-filter, and Genetic Instance Selection (GIS). Method: We conduct experiments with five base learners, optimizing their hyper parameters, on 13 datasets from PROMISE repository in order to compare the performance of LSH with benchmark instance selection methods NN-Filter and GIS. Results: The statistical tests show six distinct groups for F-measure performance. The top two group contains only LSH and GIS benchmarks whereas the bottom two groups contain only NN-Filter variants. LSH and GIS favor recall more than precision. In fact, for precision performance only three significantly distinct groups are detected by the tests where the top group is comprised of NN-Filter variants only. Recall wise, 16 different groups are identified where the top three groups contain only LSH methods, four of the next six are GIS only and the bottom five contain only NN-Filter. Finally, NN-Filter benchmarks never outperform the LSH counterparts with the same base learner, tuned or non-tuned. Further, they never even belong to the same rank group, meaning that LSH is always significantly better than NN-Filter with the same learner and settings. Conclusions: The increase in performance and the decrease in computational overhead and runtime make LSH a promising approach. However, the performance of LSH is based on high recall and in environments where precision is considered more important NN-Filter should be considered.

SEDec 18, 2020
A First Look at Human Values-Violation in App Reviews

Humphrey O. Obie, Waqar Hussain, Xin Xia et al.

Ubiquitous technologies such as mobile software applications (mobile apps) have a tremendous influence on the evolution of the social, cultural, economic, and political facets of life in society. Mobile apps fulfil many practical purposes for users including entertainment, transportation, financial management, etc. Given the ubiquity of mobile apps in the lives of individuals and the consequent effect of these technologies on society, it is essential to consider the relationship between human values and the development and deployment of mobile apps. The many negative consequences of violating human values such as privacy, fairness or social justice by technology have been documented in recent times. If we can detect these violations in a timely manner, developers can look to better address them. To understand the violation of human values in a range of common mobile apps, we analysed 22,119 app reviews from Google Play Store using natural language processing techniques. We base our values violation detection approach on a widely accepted model of human values; the Schwartz theory of basic human values. The results of our analysis show that 26.5% of the reviews contained text indicating user perceived violations of human values. We found that benevolence and self-direction were the most violated value categories, and conformity and tradition were the least violated categories. Our results also highlight the need for a proactive approach to the alignment of values amongst stakeholders and the use of app reviews as a valuable additional source for mining values requirements.

SENov 24, 2020
A Family of Experiments on Test-Driven Development

Adrian Santos, Sira Vegas, Oscar Dieste et al.

Context: Test-driven development (TDD) is an agile software development approach that has been widely claimed to improve software quality. However, the extent to which TDD improves quality appears to be largely dependent upon the characteristics of the study in which it is evaluated (e.g., the research method, participant type, programming environment, etc.). The particularities of each study make the aggregation of results untenable. Objectives: The goal of this paper is to: increase the accuracy and generalizability of the results achieved in isolated experiments on TDD, provide joint conclusions on the performance of TDD across different industrial and academic settings, and assess the extent to which the characteristics of the experiments affect the quality-related performance of TDD. Method: We conduct a family of 12 experiments on TDD in academia and industry. We aggregate their results by means of meta-analysis. We perform exploratory analyses to identify variables impacting the quality-related performance of TDD. Results: TDD novices achieve a slightly higher code quality with iterative test-last development (i.e., ITL, the reverse approach of TDD) than with TDD. The task being developed largely determines quality. The programming environment, the order in which TDD and ITL are applied, or the learning effects from one development approach to another do not appear to affect quality. The quality-related performance of professionals using TDD drops more than for students. We hypothesize that this may be due to their being more resistant to change and potentially less motivated than students. Conclusion: Previous studies seem to provide conflicting results on TDD performance (i.e., positive vs. negative, respectively). We hypothesize that these conflicting results may be due to different study durations, experiment participants being unfamiliar with the TDD process...

SENov 12, 2020
A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits

Steffen Herbold, Alexander Trautsch, Benjamin Ledel et al.

Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.

SEOct 7, 2020
Empirical Standards for Software Engineering Research

Paul Ralph, Nauman bin Ali, Sebastian Baltes et al.

Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around research best practices, will improve research quality and make peer review more effective, reliable, transparent and fair.

SEAug 28, 2020
Researcher Bias in Software Engineering Experiments: a Qualitative Investigation

Simone Romano, Davide Fucci, Giuseppe Scanniello et al.

Researcher Bias (RB) occurs when researchers influence the results of an empirical study based on their expectations.RB might be due to the use of Questionable Research Practices(QRPs). In research fields like medicine, blinding techniques have been applied to counteract RB. We conducted an explorative qualitative survey to investigate RB in Software Engineering (SE)experiments, with respect to (i) QRPs potentially leading to RB, (ii) causes behind RB, and (iii) possible actions to counteract including blinding techniques. Data collection was based on semi-structured interviews. We interviewed nine active experts in the empirical SE community. We then analyzed the transcripts of these interviews through thematic analysis. We found that some QRPs are acceptable in certain cases. Also, it appears that the presence of RB is perceived in SE and, to counteract RB, a number of solutions have been highlighted: some are intended for SE researchers and others for the boards of SE research outlets.

SEMay 3, 2020
Pandemic Programming: How COVID-19 affects software developers and how their organizations can help

Paul Ralph, Sebastian Baltes, Gianisa Adisaputri et al.

Context. As a novel coronavirus swept the world in early 2020, thousands of software developers began working from home. Many did so on short notice, under difficult and stressful conditions. Objective. This study investigates the effects of the pandemic on developers' wellbeing and productivity. Method. A questionnaire survey was created mainly from existing, validated scales and translated into 12 languages. The data was analyzed using non-parametric inferential statistics and structural equation modeling. Results. The questionnaire received 2225 usable responses from 53 countries. Factor analysis supported the validity of the scales and the structural model achieved a good fit (CFI = 0.961, RMSEA = 0.051, SRMR = 0.067). Confirmatory results include: (1) the pandemic has had a negative effect on developers' wellbeing and productivity; (2) productivity and wellbeing are closely related; (3) disaster preparedness, fear related to the pandemic and home office ergonomics all affect wellbeing or productivity. Exploratory analysis suggests that: (1) women, parents and people with disabilities may be disproportionately affected; (2) different people need different kinds of support. Conclusions. To improve employee productivity, software companies should focus on maximizing employee wellbeing and improving the ergonomics of employees' home offices. Women, parents and disabled persons may require extra support.

SEApr 11, 2020
Increasing Validity Through Replication: An Illustrative TDD Case

Adrian Santos, Sira Vegas, Fernando Uyaguari et al.

Context: Software Engineering (SE) experiments suffer from threats to validity that may impact their results. Replication allows researchers building on top of previous experiments' weaknesses and increasing the reliability of the findings. Objective: Illustrating the benefits of replication to increase the reliability of the findings and uncover moderator variables. Method: We replicate an experiment on Test-Driven-Development (TDD) and address some of its threats to validity and those of a previous replication. We compare the replications' results and hypothesize on plausible moderators impacting results. Results: Differences across TDD replications' results might be due to the operationalization of the response variables, the allocation of subjects to treatments, the allowance to work outside the laboratory, the provision of stubs, or the task. Conclusion: Replications allow examining the robustness of the findings, hypothesizing on plausible moderators influencing results, and strengthening the evidence obtained.

SESep 11, 2019
Iterative versus Exhaustive Data Selection for Cross Project Defect Prediction: An Extended Replication Study

Seyedrebvar Hosseini, Burak Turhan

Context: The effectiveness of data selection approaches in improving the performance of cross project defect prediction(CPDP) has been shown in multiple previous studies. Beside that, replication studies play an important role in the support of any valid study. Repeating a study using the same or different subjects can lead to better understandings of the nature of the problem. Objective: We use an iterative dataset selection (IDS) approach to generate training datasets and evaluate them on a set of randomly created validation datasets in the context of CPDP while considering a higher range of flexibility which makes the approach more feasible in practice. Method: We replicate an earlier study and present some insights into the achieved results while pointing out some of the shortcomings of the original study. Using the lessons learned, we propose to use an alternative training/validation dataset generation approaches which not only is more feasible in practice, but also achieves slightly better performances. We compare the results of our experiments to those from scenarios A, B, C and D from the original study. Results:Our experiments reveal that IDS is heavily recall based. The average recall performance for all test sets is 0.933 which is significantly higher than that from the replicated method. This in turn comes with a loss in precision. IDS has the lowest precision among the compared scenarios that use Decision Table learner. IDS however, achieves comparable or better F-measure performances. IDS achieves higher mean, median and min F-measure values while being more stable generally, in comparison with the replicated method. Conclusions: We conclude that datasets obtained from iterative/search-based approaches is a promising way to tackle CPDP. Especially, the performance increase in terms of both time and performance encourages further investigation of our approach.

SEFeb 28, 2019
Requirements Framing Affects Design Creativity

Rahul Mohanani, Burak Turhan, Paul Ralph

Design creativity, the originality and practicality of a solution concept is critical for the success of many software projects. However, little research has investigated the relationship between the way desiderata are presented and design creativity. This study therefore investigates the impact of presenting desiderata as ideas, requirements or prioritized requirements on design creativity. Two between-subjects randomized controlled experiments were conducted with 42 and 34 participants. Participants were asked to create design concepts from a list of desiderata. Participants who received desiderata framed as requirements or prioritized requirements created designs that are, on average, less original but more practical than the designs created by participants who received desiderata framed as ideas. This suggests that more formal, structured presentations of desiderata are less appropriate where a creative solution is desired. The results also show that design performance is highly susceptible to minor changes in the vernacular used to communicate desiderata.

SEOct 30, 2018
Key Stakeholders' Value Propositions for Feature Selection in Software-intensive Products: An Industrial Case Study

Pilar Rodríguez, Emilia Mendes, Burak Turhan

Numerous software companies are adopting value-based decision making. However, what does value mean for key stakeholders making decisions? How do different stakeholder groups understand value? Without an explicit understanding of what value means, decisions are subject to ambiguity and vagueness, which are likely to bias them. This case study provides an in-depth analysis of key stakeholders' value propositions when selecting features for a large telecommunications company's software-intensive product. Stakeholders' value propositions were elicited via interviews, which were analyzed using Grounded Theory coding techniques (open and selective coding). Thirty-six value propositions were identified and classified into six dimensions: customer value, market competitiveness, economic value/profitability, cost efficiency, technology & architecture, and company strategy. Our results show that although propositions in the customer value dimension were those mentioned the most, the concept of value for feature selection encompasses a wide range of value propositions. Moreover, stakeholder groups focused on different and complementary value dimensions, calling to the importance of involving all key stakeholders in the decision making process. Although our results are particularly relevant to companies similar to the one described herein, they aim to generate a learning process on value-based feature selection for practitioners and researchers in general.

SESep 5, 2018
On the Need of Preserving Order of Data When Validating Within-Project Defect Classifiers

Davide Falessi, Jacky Huang, Likhita Narayana et al.

[Context] The use of defect prediction models, such as classifiers, can support testing resource allocations by using data of the previous releases of the same project for predicting which software components are likely to be defective. A validation technique, hereinafter technique defines a specific way to split available data in training and test sets to measure a classifier accuracy. Time-series techniques have the unique ability to preserve the temporal order of data; i.e., preventing the testing set to have data antecedent to the training set. [Aim] The aim of this paper is twofold: first we check if there is a difference in the classifiers accuracy measured by time-series versus non-time-series techniques. Afterward, we check for a possible reason for this difference, i.e., if defect rates change across releases of a project. [Method] Our method consists of measuring the accuracy, i.e., AUC, of 10 classifiers on 13 open and two closed projects by using three validation techniques, namely cross validation, bootstrap, and walk-forward, where only the latter is a time-series technique. [Results] We find that the AUC of the same classifier used on the same project and measured by 10-fold varies compared to when measured by walk-forward in the range [-0.20, 0.22], and it is statistically different in 45% of the cases. Similarly, the AUC measured by bootstrap varies compared to when measured by walk-forward in the range [-0.17, 0.43], and it is statistically different in 56% of the cases. [Conclusions] We recommend choosing the technique to be used by carefully considering the conclusions to draw, the property of the available datasets, and the level of realism with the classifier usage scenario.

SEJul 11, 2018
The Effect of Noise on Sofware Engineers' Performance

Simone Romano, Giuseppe Scanniello, Davide Fucci et al.

Background: Noise, defined as an unwanted sound, is one of the commonest factors that could affect people's performance in their daily work activities. The software engineering research community has marginally investigated the effects of noise on software engineers' performance. Aims: We studied if noise affects software engineers' performance in (i) comprehending functional requirements and (ii) fixing faults in the source code. Method: We conducted two experiments with final-year undergraduate students in Computer Science. In the first experiment, we asked 55 students to comprehend functional requirements exposing them or not to noise, while in the second experiment 42 students were asked to fix faults in Java code. Results: The participants in the second experiment, when exposed to noise, had significantly worse performance in fixing faults in the source code. On the other hand, we did not observe any statistically significant difference in the first experiment. Conclusions: Fixing faults in source code seems to be more vulnerable to noise than comprehending functional requirements.

SEJul 12, 2017
Cognitive Biases in Software Engineering: A Systematic Mapping Study

Rahul Mohanani, Iflaah Salman, Burak Turhan et al.

One source of software project challenges and failures is the systematic errors introduced by human cognitive biases. Although extensively explored in cognitive psychology, investigations concerning cognitive biases have only recently gained popularity in software engineering (SE) research. This paper therefore systematically maps, aggregates and synthesizes the literature on cognitive biases in software engineering to generate a comprehensive body of knowledge, understand state of the art research and provide guidelines for future research and practise. Focusing on bias antecedents, effects and mitigation techniques, we identified 65 articles, which investigate 37 cognitive biases, published between 1990 and 2016. Despite strong and increasing interest, the results reveal a scarcity of research on mitigation techniques and poor theoretical foundations in understanding and interpreting cognitive biases. Although bias-related research has generated many new insights in the software engineering community, specific bias mitigation techniques are still needed for software professionals to overcome the deleterious effects of cognitive biases on their work.

SENov 18, 2016
A Dissection of the Test-Driven Development Process: Does It Really Matter to Test-First or to Test-Last?

Davide Fucci, Hakan Erdogmus, Burak Turhan et al.

Background: Test-driven development (TDD) is a technique that repeats short coding cycles interleaved with testing. The developer first writes a unit test for the desired functionality, followed by the necessary production code, and refactors the code. Many empirical studies neglect unique process characteristics related to TDD iterative nature. Aim: We formulate four process characteristic: sequencing, granularity, uniformity, and refactoring effort. We investigate how these characteristics impact quality and productivity in TDD and related variations. Method: We analyzed 82 data points collected from 39 professionals, each capturing the process used while performing a specific development task. We built regression models to assess the impact of process characteristics on quality and productivity. Quality was measured by functional correctness. Result: Quality and productivity improvements were primarily positively associated with the granularity and uniformity. Sequencing, the order in which test and production code are written, had no important influence. Refactoring effort was negatively associated with both outcomes. We explain the unexpected negative correlation with quality by possible prevalence of mixed refactoring. Conclusion: The claimed benefits of TDD may not be due to its distinctive test-first dynamic, but rather due to the fact that TDD-like processes encourage fine-grained, steady steps that improve focus and flow.