SESep 19, 2024Code
On the Effectiveness of LLMs for Manual Test VerificationsMyron David Lucena Campos Peixoto, Davy de Medeiros Baia, Nathalia Nascimento et al.
Background: Manual testing is vital for detecting issues missed by automated tests, but specifying accurate verifications is challenging. Aims: This study aims to explore the use of Large Language Models (LLMs) to produce verifications for manual tests. Method: We conducted two independent and complementary exploratory studies. The first study involved using 2 closed-source and 6 open-source LLMs to generate verifications for manual test steps and evaluate their similarity to original verifications. The second study involved recruiting software testing professionals to assess their perception and agreement with the generated verifications compared to the original ones. Results: The open-source models Mistral-7B and Phi-3-mini-4k demonstrated effectiveness and consistency comparable to closed-source models like Gemini-1.5-flash and GPT-3.5-turbo in generating manual test verifications. However, the agreement level among professional testers was slightly above 40%, indicating both promise and room for improvement. While some LLM-generated verifications were considered better than the originals, there were also concerns about AI hallucinations, where verifications significantly deviated from expectations. Conclusion: We contributed by generating a dataset of 37,040 test verifications using 8 different LLMs. Although the models show potential, the relatively modest 40% agreement level highlights the need for further refinement. Enhancing the accuracy, relevance, and clarity of the generated verifications is crucial to ensure greater reliability in real-world testing scenarios.
SENov 9, 2018Code
Influence of Technical and Social Factors for Introducing BugsFilipe Falcão, Caio Barbosa, Baldoino Fonseca et al.
[This paper has been withdrawn by the author due to updated research available on arXiv (arXiv:1811.01918)] As the modern open-source paradigm makes it easier to contribute to software projects, the number of developers involved in these projects keep increasing. This growth in the amount of developers makes it more difficult to deal with harmful contributions. Recent researches have found that technical and social factors can predict the success of contributions to open-source projects on GitHub. However, these researches do not study the relation between these factors with the introduction of bugs. Our study aims at investigating the influence of technical (such as, developers' experience) and social (such as, number of followers) factors on the introduction of bugs, using information from 14 projects hosted on GitHub. Understanding the influence of these factors may be useful to developers, code reviewers and researchers. For instance, code reviewers may want to double check commits from developers that present bug-related factors. We found that technical factors have a consistent influence in the introduction of bugs. On the other hand, social factors present signs of influence in bug introduction that would require more data to be properly evaluated. Moreover, we found that perils present in the mining of GitHub may impact the factors results.
SEApr 30
Beyond Code, We Are People: A Systematic Mapping of 25 Years of Literature on Soft Skills in Agile Development TeamsIsraely Lima, Lucas Moura Lourenço, Márcio Ribeiro et al.
Software development is a sociotechnical and human-centered endeavor in which human factors directly influence quality, productivity, and innovation capacity. In this context, career development in computing goes beyond technical mastery, requiring competencies that enable professionals to deal with continuous change and collaborative demands. Among these, non-technical skills (soft skills) stand out, encompassing social, emotional, and communicational dimensions essential to team effectiveness and the success of software projects. Despite their recognized importance, there is still a need for a systematic mapping of the most relevant soft skills over the past 25 years, a period marked by the adoption of agile approaches in industry. This gap limits the integration of human and technical aspects in software development. This study presents a systematic mapping of the literature, analyzing 97 studies published between January 2000 and May 2025 across major scientific databases. The results identify recurring competencies such as communication, adaptability, teamwork, and leadership, as well as their association with different roles in agile contexts. The main agile approaches adopted, particularly Scrum, are also identified, along with key gaps in the literature, such as the lack of studies on role specific soft skills. The findings can support researchers, educators, and practitioners in designing curricula, training strategies, and organizational practices aligned with human factors, reinforcing the importance of integrating social and technical dimensions in the development of collaborative and innovative professionals.
CLMay 24, 2025
Assessing the Capability of LLMs in Solving POSCOMP QuestionsCayo Viegas, Rohit Gheyi, Márcio Ribeiro
Recent advancements in Large Language Models (LLMs) have significantly expanded the capabilities of artificial intelligence in natural language processing tasks. Despite this progress, their performance in specialized domains such as computer science remains relatively unexplored. Understanding the proficiency of LLMs in these domains is critical for evaluating their practical utility and guiding future developments. The POSCOMP, a prestigious Brazilian examination used for graduate admissions in computer science promoted by the Brazlian Computer Society (SBC), provides a challenging benchmark. This study investigates whether LLMs can match or surpass human performance on the POSCOMP exam. Four LLMs - ChatGPT-4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral Large - were initially evaluated on the 2022 and 2023 POSCOMP exams. The assessments measured the models' proficiency in handling complex questions typical of the exam. LLM performance was notably better on text-based questions than on image interpretation tasks. In the 2022 exam, ChatGPT-4 led with 57 correct answers out of 69 questions, followed by Gemini 1.0 Advanced (49), Le Chat Mistral (48), and Claude 3 Sonnet (44). Similar trends were observed in the 2023 exam. ChatGPT-4 achieved the highest performance, surpassing all students who took the POSCOMP 2023 exam. LLMs, particularly ChatGPT-4, show promise in text-based tasks on the POSCOMP exam, although image interpretation remains a challenge. Given the rapid evolution of LLMs, we expanded our analysis to include more recent models - o1, Gemini 2.5 Pro, Claude 3.7 Sonnet, and o3-mini-high - evaluated on the 2022-2024 POSCOMP exams. These newer models demonstrate further improvements and consistently surpass both the average and top-performing human participants across all three years.
CRSep 14, 2021
Exploring the Use of Static and Dynamic Analysis to Improve the Performance of the Mining Sandbox Approach for Android Malware IdentificationFrancisco Handrick da Costa, Ismael Medeiros, Thales Menezes et al.
The Android mining sandbox approach consists in running dynamic analysis tools on a benign version of an Android app and recording every call to sensitive APIs. Later, one can use this information to (a) prevent calls to other sensitive APIs (those not previously recorded) or (b) run the dynamic analysis tools again in a different version of the app -- in order to identify possible malicious behavior. Although the use of dynamic analysis for mining Android sandboxes has been empirically investigated before, little is known about the potential benefits of combining static analysis with the mining sandbox approach for identifying malicious behavior. As such, in this paper we present the results of two empirical studies: The first is a non-exact replication of a previous research work from Bao et al., which compares the performance of test case generation tools for mining Android sandboxes. The second is a new experiment to investigate the implications of using taint analysis algorithms to complement the mining sandbox approach in the task to identify malicious behavior. Our study brings several findings. For instance, the first study reveals that a static analysis component of DroidFax (a tool used for instrumenting Android apps in the Bao et al. study) contributes substantially to the performance of the dynamic analysis tools explored in the previous work. The results of the second study show that taint analysis is also practical to complement the mining sandboxes approach, improve the performance of the later strategy in at most 28.57%.
SEJul 13, 2021
What Evidence We Would Miss If We Do Not Use Grey Literature?Fernando Kamei, Gustavo Pinto, Igor Wiese et al.
Context: Over the last years, Grey Literature (GL) is gaining increasing attention in Secondary Studies in Software Engineering (SE). Notably, Multivocal Literature Review (MLR) studies, that search for evidence in both Traditional Literature (TL) and GL, is particularly benefiting from this raise of GL content. Despite the growing interest in MLR-based studies, the literature assessing how GL has contributed to MLR studies is still scarce. Objective: This research aims to assess how the use of GL contributed to MLR studies. By contributing, we mean, understanding to what extent GL is providing evidence that is indeed used by an MLR to answer its research question. Method: We conducted a tertiary study to identify MLR studies published between 2017 and 2019, selecting nine MLRs studies. Using qualitative and quantitative analysis, we identified the GL used and assessed to what extent these MLRs are contributing to MLR studies. Results: Our analysis identified that 1) GL provided evidence not found in TL, 2) most of the GL sources were used to provide recommendations to solve problems, explain a topic, and classify the findings, and 3) 19 different GL types were used in the studies; these GLs were mainly produced by SE practitioners (including blog posts, slides presentations, or project descriptions). Conclusions: We evidence how GL contributed to MLR studies. We observed that if these GLs were not included in the MLR, several findings would have been omitted or weakened. We also described the challenges involved when conducting this investigation, along with potential ways to deal with them, which may help future SE researchers.
SEApr 27, 2021
Grey Literature in Software Engineering: A Critical ReviewFernando Kamei, Igor Wiese, Crescencio Lima et al.
Context: Grey Literature (GL) recently has grown in Software Engineering (SE) research since the increased use of online communication channels by software engineers. However, there is still a limited understanding of how SE research is taking advantage of GL. Objective: This research aimed to understand how SE researchers use GL in their secondary studies. Method: We conducted a tertiary study of studies published between 2011 and 2018 in high-quality software engineering conferences and journals. We then applied qualitative and quantitative analysis to investigate 446 potential studies. Results: From the 446 selected studies, 126 studies cited GL but only 95 of those used GL to answer a specific research question representing almost 21% of all the 446 secondary studies. Interestingly, we identified that few studies employed specific search mechanisms and used additional criteria for assessing GL. Moreover, by the time we conducted this research, 49% of the GL URLs are not working anymore. Based on our findings, we discuss some challenges in using GL and potential mitigation plans. Conclusion: In this paper, we summarized the last 10 years of software engineering research that uses GL, showing that GL has been essential for bringing practical new perspectives that are scarce in traditional literature. By drawing the current landscape of use, we also raise some awareness of related challenges (and strategies to deal with them).
SESep 13, 2020
On the Use of Grey Literature: A Survey with the Brazilian Software Engineering Research CommunityFernando Kamei, Igor Wiese, Gustavo Pinto et al.
Background: The use of Grey Literature (GL) has been investigate in diverse research areas. In Software Engineering (SE), this topic has an increasing interest over the last years. Problem: Even with the increase of GL published in diverse sources, the understanding of their use on the SE research community is still controversial. Objective: To understand how Brazilian SE researchers use GL, we aimed to become aware of the criteria to assess the credibility of their use, as well as the benefits and challenges. Method: We surveyed 76 active SE researchers participants of a flagship SE conference in Brazil, using a questionnaire with 11 questions to share their views on the use of GL in the context of SE research. We followed a qualitative approach to analyze open questions. Results: We found that most surveyed researchers use GL mainly to understand new topics. Our work identified new findings, including: 1) GL sources used by SE researchers (e.g., blogs, community website); 2) motivations to use (e.g., to understand problems and to complement research findings) or reasons to avoid GL (e.g., lack of reliability, lack of scientific value); 3) the benefit that is easy to access and read GL and the challenge of GL to have its scientific value recognized; and 4) criteria to assess GL credibility, showing the importance of the content owner to be renowned (e.g., renowned author and institutions). Conclusions: Our findings contribute to form a body of knowledge on the use of GL by SE researchers, by discussing novel (some contradictory) results and providing a set of lessons learned to both SE researchers and practitioners.
SEJun 26, 2019
Software Engineering Research Community Viewpoints on Rapid ReviewsBruno Cartaxo, Gustavo Pinto, Baldoino Fonseca et al.
Background: One of the most important current challenges of Software Engineering (SE) research is to provide relevant evidence to practice. In health related fields, Rapid Reviews (RRs) have shown to be an effective method to achieve that goal. However, little is known about how the SE research community perceives the potential applicability of RRs. Aims: The goal of this study is to understand the SE research community viewpoints towards the use of RRs as a means to provide evidence to practitioners. Method: To understand their viewpoints, we invited 37 researchers to analyze 50 opinion statements about RRs, and rate them according to what extent they agree with each statement. Q-Methodology was employed to identify the most salient viewpoints, represented by the so called factors. Results: Four factors were identified: Factor A groups undecided researchers that need more evidence before using RRs; Researchers grouped in Factor B are generally positive about RRs, but highlight the need to define minimum standards; Factor C researchers are more skeptical and reinforce the importance of high quality evidence; Researchers aligned to Factor D have a pragmatic point of view, considering RRs can be applied based on the context and constraints faced by practitioners. Conclusions: In conclusion, although there are opposing viewpoints, there are also some common grounds. For example, all viewpoints agree that both RRs and Systematic Reviews can be poorly or well conducted.
SENov 5, 2018
On Relating Technical, Social Factors, and the Introduction of BugsFilipe Falcão, Caio Barbosa, Baldoino Fonseca et al.
As collaborative coding environments make it easier to contribute to software projects, the number of developers involved in these projects keeps increasing. This increase makes it more difficult for code reviewers to deal with buggy contributions. Collaborative environments like GitHub provide a rich source of data on developers' contributions. Such data can be used to extract information about developers regarding technical (e.g., their experience) and social (e.g., their interactions) factors. Recent studies analyzed the influence of these factors on different activities of software development. However, there is still room for improvement on the relation between these factors and the introduction of bugs. We present a broader study, including 8 projects from different domains and 6,537 bug reports, on relating five technical, three social factors, and the introduction of bugs. The results indicate that technical and social factors can discriminate between buggy and clean commits. But, the technical factors are more determining than social ones. Particularly, the developers' habits of not following technical contribution norms and the developer's commit bugginess are associated with an increase on commit bugginess. On the other hand, project's establishment, ownership level of developers' commit, and social influence are related to a lower chance of introducing bugs.
SEFeb 5, 2016
A Comparison of 10 Sampling Algorithms for Configurable SystemsFlávio Medeiros, Christian Kästner, Márcio Ribeiro et al.
Almost every software system provides configuration options to tailor the system to the target platform and application scenario. Often, this configurability renders the analysis of every individual system configuration infeasible. To address this problem, researchers have proposed a diverse set of sampling algorithms. We present a comparative study of 10 state-of-the-art sampling algorithms regarding their fault-detection capability and size of sample sets. The former is important to improve software quality and the latter to reduce the time of analysis. In a nutshell, we found that sampling algorithms with larger sample sets are able to detect higher numbers of faults, but simple algorithms with small sample sets, such as most-enabled-disabled, are the most efficient in most contexts. Furthermore, we observed that the limiting assumptions made in previous work influence the number of detected faults, the size of sample sets, and the ranking of algorithms. Finally, we have identified a number of technical challenges when trying to avoid the limiting assumptions, which questions the practicality of certain sampling algorithms.