23.0SEJun 4
Human Oversight and Overload: Two Hidden and Costly Burdens of AI-Assisted Software EngineeringVahid Garousi
AI is changing how software engineers work, but it often comes with hidden burdens and costs. In this paper, we characterize two such often-overlooked burdens: (1) the constant need for human oversight and inspection of AI-generated artifacts; and (2) the growing cognitive overload on software engineers from receiving large amounts of suggestions from AI tools. The need for human oversight is not optional-engineers must review, validate, and sometimes rework what AI produces. At the same time, the flood of AI suggestions, prompts, and possible solutions can leave developers mentally stretched. By blending evidence from recent opinions from practitioners, we highlight these often-overlooked challenges and open a conversation about how teams can handle them in day-to-day AI-assisted software engineering.
19.8SEMar 20
How Software Engineers Engage with AI: A Pragmatic WorkflowVahid Garousi, Zafar Jafarov, Aytan Mövsümova et al.
Artificial Intelligence (AI) tools such as GitHub Copilot and ChatGPT are increasingly used in software engineering (SE) for tasks such as code, test, and documentation generation. However, engineers often face uncertainty about when to trust, refine, or discard AI-generated artifacts. We present a pragmatic workflow, complemented by a four-quadrant decision model, that formalizes how developers iteratively prompt, inspect, refine, and, when needed, fall back to manual work. The workflow and decision model were derived from a grey literature review and field observations across three industrial settings in Türkiye and Azerbaijan. Two real-world scenarios demonstrate the workflow's practical value, showing how engineers navigate key decision points when using AI. Our approach offers lightweight, structured guidance to support more deliberate and quality-aware use of AI tools in everyday SE tasks.
SEApr 5, 2021Code
Model-based testing in practice: An experience report from the web applications domainVahid Garousi, Alper Buğra Keleş, Yunus Balaman et al.
In the context of a large software testing company, we have deployed the model-based testing (MBT) approach to take the company's test automation practices to higher levels of maturity /and capability. We have chosen, from a set of open-source/commercial MBT tools, an open-source tool named GraphWalker, and have pragmatically used MBT for end-to-end test automation of several large web and mobile applications under test. The MBT approach has provided, so far in our project, various tangible and intangible benefits in terms of improved test coverage (number of paths tested), improved test-design practices, and also improved real-fault detection effectiveness. The goal of this experience report (applied research report), done based on "action research", is to share our experience of applying and evaluating MBT as a software technology (technique and tool) in a real industrial setting. We aim at contributing to the body of empirical evidence in industrial application of MBT by sharing our industry-academia project on applying MBT in practice, the insights that we have gained, and the challenges and questions that we have faced and tackled so far. We discuss an overview of the industrial setting, provide motivation, explain the events leading to the outcomes, discuss the challenges faced, summarize the outcomes, and conclude with lessons learned, take-away messages, and practical advices based on the described experience. By learning from the best practices in this paper, other test engineers could conduct more mature MBT in their test projects.
SEApr 26, 2025
Why you shouldn't fully trust ChatGPT: A synthesis of this AI tool's error rates across disciplines and the software engineering lifecycleVahid Garousi
Context: ChatGPT and other large language models (LLMs) are widely used across healthcare, business, economics, engineering, and software engineering (SE). Despite their popularity, concerns persist about their reliability, especially their error rates across domains and the software development lifecycle (SDLC). Objective: This study synthesizes and quantifies ChatGPT's reported error rates across major domains and SE tasks aligned with SDLC phases. It provides an evidence-based view of where ChatGPT excels, where it fails, and how reliability varies by task, domain, and model version (GPT-3.5, GPT-4, GPT-4-turbo, GPT-4o). Method: A Multivocal Literature Review (MLR) was conducted, gathering data from academic studies, reports, benchmarks, and grey literature up to 2025. Factual, reasoning, coding, and interpretive errors were considered. Data were grouped by domain and SE phase and visualized using boxplots to show error distributions. Results: Error rates vary across domains and versions. In healthcare, rates ranged from 8% to 83%. Business and economics saw error rates drop from ~50% with GPT-3.5 to 15-20% with GPT-4. Engineering tasks averaged 20-30%. Programming success reached 87.5%, though complex debugging still showed over 50% errors. In SE, requirements and design phases showed lower error rates (~5-20%), while coding, testing, and maintenance phases had higher variability (10-50%). Upgrades from GPT-3.5 to GPT-4 improved reliability. Conclusion: Despite improvements, ChatGPT still exhibits non-negligible error rates varying by domain, task, and SDLC phase. Full reliance without human oversight remains risky, especially in critical settings. Continuous evaluation and critical validation are essential to ensure reliability and trustworthiness.
SEDec 25, 2020
Mining user reviews of COVID contact-tracing apps: An exploratory analysis of nine European appsVahid Garousi, David Cutting, Michael Felderer
Context: More than 50 countries have developed COVID contact-tracing apps to limit the spread of coronavirus. However, many experts and scientists cast doubt on the effectiveness of those apps. For each app, a large number of reviews have been entered by end-users in app stores. Objective: Our goal is to gain insights into the user reviews of those apps, and to find out the main problems that users have reported. Our focus is to assess the "software in society" aspects of the apps, based on user reviews. Method: We selected nine European national apps for our analysis and used a commercial app-review analytics tool to extract and mine the user reviews. For all the apps combined, our dataset includes 39,425 user reviews. Results: Results show that users are generally dissatisfied with the nine apps under study, except the Scottish ("Protect Scotland") app. Some of the major issues that users have complained about are high battery drainage and doubts on whether apps are really working. Conclusion: Our results show that more work is needed by the stakeholders behind the apps (e.g., app developers, decision-makers, public health experts) to improve the public adoption, software quality and public perception of these apps.
SESep 30, 2020
Retrieving and mining professional experience of software practice from grey literature: an exploratory reviewAusten Rainer, Ashley Williams, Vahid Garousi et al.
Background: Retrieving and mining practitioners' self--reports of their professional experience of software practice could provide valuable evidence for research. We are, however, unaware of any existing reviews of research conducted in this area. Objective: To review and classify previous research, and to identify insights into the challenges research confronts when retrieving and mining practitioners' self-reports of their experience of software practice. Method: We conduct an exploratory review to identify and classify 42 articles. We analyse a selection of those articles for insights on challenges to mining professional experience. Results: We identify only one directly relevant article. Even then this article concerns the software professional's emotional experiences rather than the professional's reporting of behaviour and events occurring during software practice. We discuss challenges concerning: the prevalence of professional experience; definitions, models and theories; the sparseness of data; units of discourse analysis; annotator agreement; evaluation of the performance of algorithms; and the lack of replications. Conclusion: No directly relevant prior research appears to have been conducted in this area. We discuss the value of reporting negative results in secondary studies. There are a range of research opportunities but also considerable challenges. We formulate a set of guiding questions for further research in this area.
SEMay 26, 2020
Assessing the maturity of software testing services using CMMI-SVC: An industrial case studyVahid Garousi, Seyfettin Arkan, Gökhan Urul et al.
Context: While many companies conduct their software testing activities in-house, many other companies outsource their software testing needs to other firms who act as software testing service providers. As a result, Testing as a Service (TaaS) has emerged as a strong service industry in the last several decades. In the context of software testing services, there could be various challenges (e.g., during the planning and service delivery phases) and, as a result, the quality of testing services is not always as expected. Objective: It is important, for both providers and also customers of testing services, to assess the quality and maturity of test services and subsequently improve them. Method: Motivated by a real industrial need in the context of several testing service providers, to assess the maturity of their software testing services, we chose the existing CMMI for Services maturity model (CMMI-SVC), and conducted a case study using it in the context of two Turkish testing service providers. Results: The case-study results show that maturity appraisal of testing services using CMMI-SVC was helpful for both companies and their test management teams by enabling them objectively assess the maturity of their testing services and also by pinpointing potential improvement areas. Conclusion: We empirically observed that, after some minor customization, CMMI-SVC is indeed a suitable model for maturity appraisal of testing services.
SEMay 19, 2020
Visual GUI testing in practice: An extended industrial case studyVahid Garousi, Wasif Afzal, Adem Çağlar et al.
Context: Visual GUI testing (VGT) is referred to as the latest generation GUI-based testing. It is a tool-driven technique, which uses image recognition for interacting with and asserting the behavior of the system under test. Motivated by the industrial need of a large Turkish software and systems company providing solutions in the areas of defense and IT sector, an action-research project was recently initiated to implement VGT in several teams and projects in the company. Objective: To address the above needs, we planned and carried out an empirical investigation with the goal of assessing VGT using two tools (Sikuli and JAutomate). The purpose was to determine a suitable approach and tool for VGT of a given project (software product) in the company, increase the know-how in the company's test teams. Method: Using an action-research case-study design, we investigated the use of VGT in the studied organization. Specifically, using the two selected VGT tools, we conducted a quantitative and a qualitative evaluation of VGT. Results: By assessing the list of Challenges, Problems and Limitations (CPL), proposed in previous work, in the context of our empirical study, we found that test-tool- and SUT-related CPLs were quite comparable to a previous empirical study, e.g., the synchronization between SUT and test tools were not always robust and there were failures in test tools' image recognition features. When assessing the types of test maintenance activities, when executing the automated test cases on next versions of the SUTs, for the case of the two test tools, we found that about half of the test cases (59.1% and 47.8%) failed in the next version. Conclusion: By our results, we confirm some of the previously-reported issues when conducting VGT. Further, we highlight some additional challenges in test maintenance when using VGT.
SEMar 8, 2020
Software-testing education: A systematic literature mappingVahid Garousi, Austen Rainer, Per Lauvås et al.
Context: With the rising complexity and scale of software systems, there is an ever-increasing demand for sophisticated and cost-effective software testing. To meet such a demand, there is a need for a highly-skilled software testing work-force (test engineers) in the industry. To address that need, many university educators worldwide have included software-testing education in their software engineering (SE) or computer science (CS) programs. Objective: Our objective in this paper is to summarize the body of experience and knowledge in the area of software-testing education to benefit the readers (both educators and researchers) in designing and delivering software testing courses in university settings, and to also conduct further education research in this area. Method: To address the above need, we conducted a systematic literature mapping (SLM) to synthesize what the community of educators have published on this topic. After compiling a candidate pool of 307 papers, and applying a set of inclusion/exclusion criteria, our final pool included 204 papers published between 1992 and 2019. Results: The topic of software-testing education is becoming more active, as we can see by the increasing number of papers. Many pedagogical approaches (how to best teach testing), course-ware, and specific tools for testing education have been proposed. Many challenges in testing education and insights on how to overcome those challenges have been proposed. Conclusion: This paper provides educators and researchers with a classification of existing studies within software-testing education. We further synthesize challenges and insights reported when teaching software testing. The paper also provides a reference ("index") to the vast body of knowledge and experience on teaching software testing.
SEMar 1, 2020
Experience in engineering of scientific software: The case of an optimization software for oil pipelinesVahid Garousi, Ehsan Abbasi, Bedir Tekinerdogan
Development of scientific and engineering software is usually different and could be more challenging than the development of conventional enterprise software. The authors were involved in a technology-transfer project between academia and industry which focused on engineering, development and testing of a software for optimization of pumping energy costs for oil pipelines. Experts with different skillsets (mechanical, power and software engineers) were involved. Given the complex nature of the software (a sophisticated underlying optimization model) and having experts from different fields, there were challenges in various software engineering aspects of the software system (e.g., requirements and testing). We report our observations and experience in addressing those challenges during our technology-transfer project, and aim to add to the existing body of experience and evidence in engineering of scientific and engineering software. We believe that our observations, experience and lessons learnt could be useful for other researchers and practitioners in engineering of other scientific and engineering software systems.
SENov 27, 2019
Benefitting from the Grey Literature in Software Engineering ResearchVahid Garousi, Michael Felderer, Mika V. Mäntylä et al.
Researchers generally place the most trust in peer-reviewed, published information, such as journals and conference papers. By contrast, software engineering (SE) practitioners typically do not have the time, access or expertise to review and benefit from such publications. As a result, practitioners are more likely to turn to other sources of information that they trust, e.g., trade magazines, online blog-posts, survey results or technical reports, collectively referred to as Grey Literature (GL). Furthermore, practitioners also share their ideas and experiences as GL, which can serve as a valuable data source for research. While GL itself is not a new topic in SE, using, benefitting and synthesizing knowledge from the GL in SE is a contemporary topic in empirical SE research and we are seeing that researchers are increasingly benefitting from the knowledge available within GL. The goal of this chapter is to provide an overview to GL in SE, together with insights on how SE researchers can effectively use and benefit from the knowledge and evidence available in the vast amount of GL.
DLAug 12, 2019
Citations in Software Engineering -- Paper-related, Journal-related, and Author-related FactorsMika Mäntylä, Vahid Garousi
Many factors could affect the number of citations to a paper. Citations have an important role in research policy and in measuring the excellence of research and researchers. This work is the first study in software engineering (SE) to assess multiple factors affecting the number of citations to SE papers. We use (a) negative binomial regression and (b) quantile regression to study arithmetic mean and median expected citations of a paper. Our dataset includes all the 25,113 papers which have been published in a set of 16 main SE journals, between 1970 and 2018. Our results indicate that publication venue, author team's past citations, paper length, the number of references, and the recency of references are the most influential factors on the number of citations to SE papers. From our empirical findings, we present several implications and advice to researchers for getting higher citations on their papers, which are in addition to the obvious case of conducting high-quality technical research, e.g. (1) Aim for high-profile venues, (2) Build a high-quality author team with highly cited past papers, and (3) Aim for high-quality work that has comprehensive content (thus longer paper length and reference list).
SEMar 31, 2019
Video Game Development in a Rush: A Survey of the Global Game Jam ParticipantsMarkus Borg, Vahid Garousi, Anas Mahmoud et al.
Video game development is a complex endeavor, often involving complex software, large organizations, and aggressive release deadlines. Several studies have reported that periods of "crunch time" are prevalent in the video game industry, but there are few studies on the effects of time pressure. We conducted a survey with participants of the Global Game Jam (GGJ), a 48-hour hackathon. Based on 198 responses, the results suggest that: (1) iterative brainstorming is the most popular method for conceptualizing initial requirements; (2) continuous integration, minimum viable product, scope management, version control, and stand-up meetings are frequently applied development practices; (3) regular communication, internal playtesting, and dynamic and proactive planning are the most common quality assurance activities; and (4) familiarity with agile development has a weak correlation with perception of success in GGJ. We conclude that GGJ teams rely on ad hoc approaches to development and face-to-face communication, and recommend some complementary practices with limited overhead. Furthermore, as our findings are similar to recommendations for software startups, we posit that game jams and the startup scene share contextual similarities. Finally, we discuss the drawbacks of systemic "crunch time" and argue that game jam organizers are in a good position to problematize the phenomenon.
SEDec 5, 2018
Closing the gap between software engineering education and industrial needsVahid Garousi, Görkem Giray, Eray Tüzün et al.
According to different reports, many recent software engineering graduates often face difficulties when beginning their professional careers, due to misalignment of the skills learnt in their university education with what is needed in industry. To address that need, many studies have been conducted to align software engineering education with industry needs. To synthesize that body of knowledge, we present in this paper a systematic literature review (SLR) which summarizes the findings of 33 studies in this area. By doing a meta-analysis of all those studies and using data from 12 countries and over 4,000 data points, this study will enable educators and hiring managers to adapt their education / hiring efforts to best prepare the software engineering workforce.
SEDec 4, 2018
Practical relevance of software engineering research: Synthesizing the community's voiceVahid Garousi, Markus Borg, Markku Oivo
Software engineering (SE) research should be relevant to industrial practice. There have been regular discussions in the SE community on this issue since the 1980's, led by pioneers such as Robert Glass. As we recently passed the milestone of "50 years of software engineering", some recent positive efforts have been made in this direction, e.g., establishing "industrial" tracks in several SE conferences. However, many researchers and practitioners believe that we, as a community, are still struggling with research relevance and utility. The goal of this paper is to synthesize the evidence and experience-based opinions shared on this topic so far in the SE community, and to encourage the community to further reflect and act on the research relevance. For this purpose, we have conducted a Multi-vocal Literature Review (MLR) of 54 systematically-selected sources (papers and non peer-reviewed articles). Instead of relying on and considering the individual opinions on research relevance, mentioned in each of the sources, the MLR aims to synthesize and provide the "holistic" view on the topic. The highlights of our MLR findings are as follows. The top three root causes of low relevance, discussed in the community, are: (1) Researchers having simplistic views (or wrong assumptions) about SE in practice; (2) Lack of connection with industry; and (3) Wrong identification of research problems. The top three suggestions for improving research relevance are: (1) Using appropriate research approaches such as action-research; (2) Choosing relevant research problems; and (3) Collaborating with industry. By synthesizing all the discussions on this important topic so far, this paper aims to encourage further discussions and actions in the community to increase our collective efforts to improve the research relevance.
SEJun 2, 2018
NLP-assisted software testing: A systematic mapping of the literatureVahid Garousi, Sara Bauer, Michael Felderer
Context: To reduce manual effort of extracting test cases from natural-language requirements, many approaches based on Natural Language Processing (NLP) have been proposed in the literature. Given the large amount of approaches in this area, and since many practitioners are eager to utilize such techniques, it is important to synthesize and provide an overview of the state-of-the-art in this area. Objective: Our objective is to summarize the state-of-the-art in NLP-assisted software testing which could benefit practitioners to potentially utilize those NLP-based techniques. Moreover, this can benefit researchers in providing an overview of the research landscape. Method: To address the above need, we conducted a survey in the form of a systematic literature mapping (classification). After compiling an initial pool of 95 papers, we conducted a systematic voting, and our final pool included 67 technical papers. Results: This review paper provides an overview of the contribution types presented in the papers, types of NLP approaches used to assist software testing, types of required input requirements, and a review of tool support in this area. Some key results we have detected are: (1) only four of the 38 tools (11%) presented in the papers are available for download; (2) a larger ratio of the papers (30 of 67) provided a shallow exposure to the NLP aspects (almost no details). Conclusion: This paper would benefit both practitioners and researchers by serving as an "index" to the body of knowledge in this area. The results could help practitioners utilizing the existing NLP-based techniques; this in turn reduces the cost of test-case design and decreases the amount of human resources spent on test activities. After sharing this review with some of our industrial collaborators, initial insights show that this review can indeed be useful and beneficial to practitioners.
SEJan 7, 2018
A survey on software testabilityVahid Garousi, Michael Felderer, Feyza Nur Kilicaslan
Context: Software testability is the degree to which a software system or a unit under test supports its own testing. To predict and improve software testability, a large number of techniques and metrics have been proposed by both practitioners and researchers in the last several decades. Reviewing and getting an overview of the entire state-of-the-art and state-of-the-practice in this area is often challenging for a practitioner or a new researcher. Objective: Our objective is to summarize the body of knowledge in this area and to benefit the readers (both practitioners and researchers) in preparing, measuring and improving software testability. Method: To address the above need, the authors conducted a survey in the form of a systematic literature mapping (classification) to find out what we as a community know about this topic. After compiling an initial pool of 303 papers, and applying a set of inclusion/exclusion criteria, our final pool included 208 papers. Results: The area of software testability has been comprehensively studied by researchers and practitioners. Approaches for measurement of testability and improvement of testability are the most-frequently addressed in the papers. The two most often mentioned factors affecting testability are observability and controllability. Common ways to improve testability are testability transformation, improving observability, adding assertions, and improving controllability. Conclusion: This paper serves for both researchers and practitioners as an "index" to the vast body of knowledge in the area of testability. The results could help practitioners measure and improve software testability in their projects.
SEJul 9, 2017
Guidelines for including grey literature and conducting multivocal literature reviews in software engineeringVahid Garousi, Michael Felderer, Mika V. Mäntylä
Context: A Multivocal Literature Review (MLR) is a form of a Systematic Literature Review (SLR) which includes the grey literature (e.g., blog posts and white papers) in addition to the published (formal) literature (e.g., journal and conference papers). MLRs are useful for both researchers and practitioners since they provide summaries both the state-of-the art and -practice in a given area. Objective: There are several guidelines to conduct SLR studies in SE. However, given the facts that several phases of MLRs differ from those of traditional SLRs, for instance with respect to the search process and source quality assessment. Therefore, SLR guidelines are only partially useful for conducting MLR studies. Our goal in this paper is to present guidelines on how to conduct MLR studies in SE. Method: To develop the MLR guidelines, we benefit from three inputs: (1) existing SLR guidelines in SE, (2), a literature survey of MLR guidelines and experience papers in other fields, and (3) our own experiences in conducting several MLRs in SE. All derived guidelines are discussed in the context of three examples MLRs as running examples (two from SE and one MLR from the medical sciences). Results: The resulting guidelines cover all phases of conducting and reporting MLRs in SE from the planning phase, over conducting the review to the final reporting of the review. In particular, we believe that incorporating and adopting a vast set of recommendations from MLR guidelines and experience papers in other fields have enabled us to propose a set of guidelines with solid foundations. Conclusion: Having been developed on the basis of three types of solid experience and evidence, the provided MLR guidelines support researchers to effectively and efficiently conduct new MLRs in any area of SE.
SEDec 15, 2014
A Survey of Software Engineering Practices in Turkey (extended version)Vahid Garousi, Ahmet Coşkunçay, Aysu Betin-Can et al.
Context: Understanding the types of software engineering practices and techniques used in the industry is important. There is a wide spectrum in terms of the types and maturity of software engineering practices conducted in each software team and company. To characterize the type of software engineering practices conducted in software firms, a variety of surveys have been conducted in different countries and regions. Turkey has a vibrant software industry and it is important to characterize and understand the state of software engineering practices in this industry. Objective: Our objective is to characterize and grasp a high-level view on type of software engineering practices in the Turkish software industry. Among the software engineering practices that we have surveyed in this study are the followings: software requirements, design, development, testing, maintenance, configuration management, release planning and support practices. The current survey is the most comprehensive of its type ever conducted in the context of Turkish software industry. Method: To achieve the above objective, we systematically designed an online survey with 46 questions based on our past experience in the Canadian and Turkish contexts and using the Software Engineering Body of Knowledge (SWEBOK). 202 practicing software engineers from the Turkish software industry participated in the survey. We analyze and report in this paper the results of the questions. Whenever possible, we also compare the trends and results of our survey with the results of a similar 2010 survey conducted in the Canadian software industry.