SEJun 2
Analyzing the Evolution of Structural Communities within Microservice ArchitectureAlexander Bakhtin, Matteo Esposito, Valentina Lenarduzzi et al.
In recent years, the detection of anti-patterns in microservice architecture has gained traction, particularly to identify instances of Microservice Architectural Degradation. In such tasks, the microservice architecture is often modeled as a network of microservice dependencies. Recent works have explored how to assess the evolution of such architectural networks by considering the architecture of consecutive releases of the project. Particular anti-patterns related to the structure of the service network include Wrong cuts and Knot services. Community detection is a way to identify groups of services in a network that strongly depend on each other. If such groups cannot be mapped to business processes in the system, or if the same service belongs to multiple communities, this could indicate architectural degradation due to an inappropriate division of responsibilities or unoptimized communication. Temporal community detection methods have been proposed to analyze community structure that evolves in time. We performed temporal community detection within the microservice architecture of six releases of the train-ticket benchmark and analyzed the composition of the discovered communities and their activities over time. We observed a stable architecture with a clear separation of services into two communities, which we could identify with two business processes performed by the system. We found services belonging to several communities, as well as services within the same community with both incoming and outgoing connections. The membership strength metric provided by the leveraged algorithm enables fine-grained assessment of the microservice communities.
SEJan 5Code
The Invisible Hand of AI Libraries Shaping Open Source Projects and CommunitiesMatteo Esposito, Andrea Janes, Valentina Lenarduzzi et al.
In the early 1980s, Open Source Software emerged as a revolutionary concept amidst the dominance of proprietary software. What began as a revolutionary idea has now become the cornerstone of computer science. Amidst OSS projects, AI is increasing its presence and relevance. However, despite the growing popularity of AI, its adoption and impacts on OSS projects remain underexplored. We aim to assess the adoption of AI libraries in Python and Java OSS projects and examine how they shape development, including the technical ecosystem and community engagement. To this end, we will perform a large-scale analysis on 157.7k potential OSS repositories, employing repository metrics and software metrics to compare projects adopting AI libraries against those that do not. We expect to identify measurable differences in development activity, community engagement, and code complexity between OSS projects that adopt AI libraries and those that do not, offering evidence-based insights into how AI integration reshapes software development practices.
SENov 14, 2025Code
SQuaD: The Software Quality DatasetMikel Robredo, Matteo Esposito, Davide Taibi et al.
Software quality research increasingly relies on large-scale datasets that measure both the product and process aspects of software systems. However, existing resources often focus on limited dimensions, such as code smells, technical debt, or refactoring activity, thereby restricting comprehensive analyses across time and quality dimensions. To address this gap, we present the Software Quality Dataset (SQuaD), a multi-dimensional, time-aware collection of software quality metrics extracted from 450 mature open-source projects across diverse ecosystems, including Apache, Mozilla, FFmpeg, and the Linux kernel. By integrating nine state-of-the-art static analysis tools, i.e., SonarQube, CodeScene, PMD, Understand, CK, JaSoMe, RefactoringMiner, RefactoringMiner++, and PyRef, our dataset unifies over 700 unique metrics at method, class, file, and project levels. Covering a total of 63,586 analyzed project releases, SQuaD also provides version control and issue-tracking histories, software vulnerability data (CVE/CWE), and process metrics proven to enhance Just-In-Time (JIT) defect prediction. The SQuaD enables empirical research on maintainability, technical debt, software evolution, and quality assessment at unprecedented scale. We also outline emerging research directions, including automated dataset updates and cross-project quality modeling to support the continuous evolution of software analytics. The dataset is publicly available on ZENODO (DOI: 10.5281/zenodo.17566690).
SEMar 23, 2022
What is Software Quality for AI Engineers? Towards a Thinning of the FogValentina Golendukhina, Valentina Lenarduzzi, Michael Felderer
It is often overseen that AI-enabled systems are also software systems and therefore rely on software quality assurance (SQA). Thus, the goal of this study is to investigate the software quality assurance strategies adopted during the development, integration, and maintenance of AI/ML components and code. We conducted semi-structured interviews with representatives of ten Austrian SMEs that develop AI-enabled systems. A qualitative analysis of the interview data identified 12 issues in the development of AI/ML components. Furthermore, we identified when quality issues arise in AI/ML components and how they are detected. The results of this study should guide future work on software quality assurance processes and techniques for AI/ML components.
SEJul 8, 2024
6GSoft: Software for Edge-to-Cloud ContinuumMuhammad Azeem Akbar, Matteo Esposito, Sami Hyrynsalmi et al.
In the era of 6G, developing and managing software requires cutting-edge software engineering (SE) theories and practices tailored for such complexity across a vast number of connected edge devices. Our project aims to lead the development of sustainable methods and energy-efficient orchestration models specifically for edge environments, enhancing architectural support driven by AI for contemporary edge-to-cloud continuum computing. This initiative seeks to position Finland at the forefront of the 6G landscape, focusing on sophisticated edge orchestration and robust software architectures to optimize the performance and scalability of edge networks. Collaborating with leading Finnish universities and companies, the project emphasizes deep industry-academia collaboration and international expertise to address critical challenges in edge orchestration and software architecture, aiming to drive significant advancements in software productivity and market impact.
SEJan 5
A Defect is Being Born: How Close Are We? A Time Sensitive Forecasting ApproachMikel Robredo, Matteo Esposito, Fabio Palomba et al.
Background. Defect prediction has been a highly active topic among researchers in the Empirical Software Engineering field. Previous literature has successfully achieved the most accurate prediction of an incoming fault and identified the features and anomalies that precede it through just-in-time prediction. As software systems evolve continuously, there is a growing need for time-sensitive methods capable of forecasting defects before they manifest. Aim. Our study seeks to explore the effectiveness of time-sensitive techniques for defect forecasting. Moreover, we aim to investigate the early indicators that precede the occurrence of a defect. Method. We will train multiple time-sensitive forecasting techniques to forecast the future bug density of a software project, as well as identify the early symptoms preceding the occurrence of a defect. Expected results. Our expected results are translated into empirical evidence on the effectiveness of our approach for early estimation of bug proneness.
SEFeb 5
Sovereign-by-Design A Reference Architecture for AI and Blockchain Enabled SystemsMatteo Esposito, Lodovica Marchesi, Roberto Tonelli et al.
Digital sovereignty has emerged as a central concern for modern software-intensive systems, driven by the dominance of non-sovereign cloud infrastructures, the rapid adoption of Generative AI, and increasingly stringent regulatory requirements. While existing initiatives address governance, compliance, and security in isolation, they provide limited guidance on how sovereignty can be operationalized at the architectural level. In this paper, we argue that sovereignty must be treated as a first-class architectural property rather than a purely regulatory objective. We introduce a Sovereign Reference Architecture that integrates self-sovereign identity, blockchain-based trust and auditability, sovereign data governance, and Generative AI deployed under explicit architectural control. The architecture explicitly captures the dual role of Generative AI as both a source of governance risk and an enabler of compliance, accountability, and continuous assurance when properly constrained. By framing sovereignty as an architectural quality attribute, our work bridges regulatory intent and concrete system design, offering a coherent foundation for building auditable, evolvable, and jurisdiction-aware AI-enabled systems. The proposed reference architecture provides a principled starting point for future research and practice at the intersection of software architecture, Generative AI, and digital sovereignty.
SEMay 12
A Research Agenda on Agents and Software Engineering: Outcomes from the Rio A2SE SeminarDavide Taibi, Henry Muccini, Karthik Vaidhyanathan et al.
The rise of agentic AI is reshaping software engineering in two intertwined directions: agents are increasingly applied to support software engineering tasks, and Agentic AI systems themselves are complex systems that require re-thinking currently established software engineering practices. To chart a coherent research agenda covering the two directions, we organized the A2SE seminar in Rio de Janeiro, bringing together 18 experts from academia and industry. Through structured presentations, collaborative topic clustering, and focused group discussions, participants identified six thematic areas: Governance, Software Engineering for Agents, Agents for Software Architecture, Quality and Evaluation, Sustainability, and Code, and they prioritized short-term and long-term research directions for each. This paper presents the resulting community-driven, opinionated research agenda, offering the SE community a structured foundation for coordinating efforts at this critical juncture.
SENov 3, 2025Code
Hidden in Plain Sight: Where Developers Confess Self-Admitted Technical DebtMurali Sridharan, Mikel Robredo, Leevi Rantala et al.
Context. Detecting Self-Admitted Technical Debt (SATD) is crucial for proactive software maintenance. Previous research has primarily targeted detecting and prioritizing SATD, with little focus on the source code afflicted with SATD. Our goal in this work is to connect the SATD comments with source code constructs that surround them. Method. We leverage the extensive SATD dataset PENTACET, containing code comments from over 9000 Java Open Source Software (OSS) repositories. We quantitatively infer where SATD most commonly occurs and which code constructs/statements it most frequently affects. Results and Conclusions. Our large-scale study links over 225,000 SATD comments to their surrounding code, showing that SATD mainly arises in inline code near definitions, conditionals, and exception handling, where developers face uncertainty and trade-offs, revealing it as an intentional signal of awareness during change rather than mere neglect.
SEDec 30, 2021Code
An Empirical Study of Security Practices for Microservices SystemsAli Rezaei Nasab, Mojtaba Shahin, Seyed Ali Hoseyni Raviz et al.
Despite the numerous benefits of microservices systems, security has been a critical issue in such systems. Several factors explain this difficulty, including a knowledge gap among microservices practitioners on properly securing a microservices system. To (partially) bridge this gap, we conducted an empirical study. We first manually analyzed 861 microservices security points, including 567 issues, 9 documents, and 3 wiki pages from 10 GitHub open-source microservices systems and 306 Stack Overflow posts concerning security in microservices systems. In this study, a microservices security point is referred to as "a GitHub issue, a Stack Overflow post, a document, or a wiki page that entails 5 or more microservices security paragraphs". Our analysis led to a catalog of 28 microservices security practices. We then ran a survey with 74 microservices practitioners to evaluate the usefulness of these 28 practices. Our findings demonstrate that the survey respondents affirmed the usefulness of the 28 practices. We believe that the catalog of microservices security practices can serve as a valuable resource for microservices practitioners to more effectively address security issues in microservices systems. It can also inform the research community of the required or less explored areas to develop microservices-specific security practices and tools.
SEAug 25, 2019Code
Does Code Quality Affect Pull Request Acceptance? An empirical studyValentina Lenarduzzi, Vili Nikkola, Nyyti Saarimäki et al.
Background. Pull requests are a common practice for contributing and reviewing contributions, and are employed both in open-source and industrial contexts. One of the main goals of code reviews is to find defects in the code, allowing project maintainers to easily integrate external contributions into a project and discuss the code contributions. Objective. The goal of this paper is to understand whether code quality is actually considered when pull requests are accepted. Specifically, we aim at understanding whether code quality issues such as code smells, antipatterns, and coding style violations in the pull request code affect the chance of its acceptance when reviewed by a maintainer of the project. Method. We conducted a case study among 28 Java open-source projects, analyzing the presence of 4.7 M code quality issues in 36 K pull requests. We analyzed further correlations by applying Logistic Regression and seven machine learning techniques (Decision Tree, Random Forest, Extremely Randomized Trees, AdaBoost, Gradient Boosting, XGBoost). Results. Unexpectedly, code quality turned out not to affect the acceptance of a pull request at all. As suggested by other works, other factors such as the reputation of the maintainer and the importance of the feature delivered might be more important than code quality in terms of pull request acceptance. Conclusions. Researchers already investigated the influence of the developers' reputation and the pull request acceptance. This is the first work investigating if quality of the code in pull requests affects the acceptance of the pull request or not. We recommend that researchers further investigate this topic to understand if different measures or different tools could provide some useful measures.
SEJun 30, 2019Code
Are SonarQube Rules Inducing Bugs?Valentina Lenarduzzi, Francesco Lomio, Heikki Huttunen et al.
Background. The popularity of tools for analyzing Technical Debt, and particularly the popularity of SonarQube, is increasing rapidly. SonarQube proposes a set of coding rules, which represent something wrong in the code that will soon be reflected in a fault or will increase maintenance effort. However, our local companies were not confident in the usefulness of the rules proposed by SonarQube and contracted us to investigate the fault-proneness of these rules. Objective. In this work we aim at understanding which SonarQube rules are actually fault-prone and to understand which machine learning models can be adopted to accurately identify fault-prone rules. Method. We designed and conducted an empirical study on 21 well-known mature open-source projects. We applied the SZZ algorithm to label the fault-inducing commits. We analyzed the fault-proneness by comparing the classification power of seven machine learning models. Result. Among the 202 rules defined for Java by SonarQube, only 25 can be considered to have relatively low fault-proneness. Moreover, violations considered as "bugs" by SonarQube were generally not fault-prone and, consequently, the fault-prediction power of the model proposed by SonarQube is extremely low. Conclusion. The rules applied by SonarQube for calculating technical debt should be thoroughly investigated and their harmfulness needs to be further confirmed. Therefore, companies should carefully consider which rules they really need to apply, especially if their goal is to reduce fault-proneness.
CRDec 16, 2024
On Large Language Models in Mission-Critical IT Governance: Are We Ready Yet?Matteo Esposito, Francesco Palagiano, Valentina Lenarduzzi et al.
Context. The security of critical infrastructure has been a pressing concern since the advent of computers and has become even more critical in today's era of cyber warfare. Protecting mission-critical systems (MCSs), essential for national security, requires swift and robust governance, yet recent events reveal the increasing difficulty of meeting these challenges. Aim. Building on prior research showcasing the potential of Generative AI (GAI), such as Large Language Models, in enhancing risk analysis, we aim to explore practitioners' views on integrating GAI into the governance of IT MCSs. Our goal is to provide actionable insights and recommendations for stakeholders, including researchers, practitioners, and policymakers. Method. We designed a survey to collect practical experiences, concerns, and expectations of practitioners who develop and implement security solutions in the context of MCSs. Conclusions and Future Works. Our findings highlight that the safe use of LLMs in MCS governance requires interdisciplinary collaboration. Researchers should focus on designing regulation-oriented models and focus on accountability; practitioners emphasize data protection and transparency, while policymakers must establish a unified AI framework with global benchmarks to ensure ethical and secure LLMs-based MCS governance.
SEMar 17, 2025
Generative AI for Software Architecture. Applications, Challenges, and Future DirectionsMatteo Esposito, Xiaozhou Li, Sergio Moreschini et al.
Context: Generative Artificial Intelligence (GenAI) is transforming much of software development, yet its application in software architecture is still in its infancy, and no prior study has systematically addressed the topic. Aim: We aim to systematically synthesize the use, rationale, contexts, usability, and future challenges of GenAI in software architecture. Method: We performed a multivocal literature review (MLR), analyzing peer-reviewed and gray literature, identifying current practices, models, adoption contexts, and reported challenges, extracting themes via open coding. Results: Our review identified significant adoption of GenAI for architectural decision support and architectural reconstruction. OpenAI GPT models are predominantly applied, and there is consistent use of techniques such as few-shot prompting and retrieved-augmented generation (RAG). GenAI has been applied mostly to initial stages of the Software Development Life Cycle (SDLC), such as Requirements-to-Architecture and Architecture-to-Code. Monolithic and microservice architectures were the dominant targets. However, rigorous testing of GenAI outputs was typically missing from the studies. Among the most frequent challenges are model precision, hallucinations, ethical aspects, privacy issues, lack of architecture-specific datasets, and the absence of sound evaluation frameworks. Conclusions: GenAI shows significant potential in software design, but several challenges remain on its path to greater adoption. Research efforts should target designing general evaluation methodologies, handling ethics and precision, increasing transparency and explainability, and promoting architecture-specific datasets and benchmarks to bridge the gap between theoretical possibilities and practical use.
SEJan 22, 2025
A Call for Critically Rethinking and Reforming Data Analysis in Empirical Software EngineeringMatteo Esposito, Mikel Robredo, Murali Sridharan et al.
Context: Empirical Software Engineering (ESE) drives innovation in SE through qualitative and quantitative studies. However, concerns about the correct application of empirical methodologies have existed since the 2006 Dagstuhl seminar on SE. Objective: To analyze three decades of SE research, identify mistakes in statistical methods, and evaluate experts' ability to detect and address these issues. Methods: We conducted a literature survey of ~27,000 empirical studies, using LLMs to classify statistical methodologies as adequate or inadequate. Additionally, we selected 30 primary studies and held a workshop with 33 ESE experts to assess their ability to identify and resolve statistical issues. Results: Significant statistical issues were found in the primary studies, and experts showed limited ability to detect and correct these methodological problems, raising concerns about the broader ESE community's proficiency in this area. Conclusions. Despite our study's eventual limitations, its results shed light on recurring issues from promoting information copy-and-paste from past authors' works and the continuous publication of inadequate approaches that promote dubious results and jeopardize the spread of the correct statistical strategies among researchers. Besides, it justifies further investigation into empirical rigor in software engineering to expose these recurring issues and establish a framework for reassessing our field's foundation of statistical methodology application. Therefore, this work calls for critically rethinking and reforming data analysis in empirical software engineering, paving the way for our work soon.
SEJun 27, 2025
Autonomic Microservice Management via Agentic AI and MAPE-K IntegrationMatteo Esposito, Alexander Bakhtin, Noman Ahmad et al.
While microservices are revolutionizing cloud computing by offering unparalleled scalability and independent deployment, their decentralized nature poses significant security and management challenges that can threaten system stability. We propose a framework based on MAPE-K, which leverages agentic AI, for autonomous anomaly detection and remediation to address the daunting task of highly distributed system management. Our framework offers practical, industry-ready solutions for maintaining robust and secure microservices. Practitioners and researchers can customize the framework to enhance system stability, reduce downtime, and monitor broader system quality attributes such as system performance level, resilience, security, and anomaly management, among others.
SESep 9, 2025
What Were You Thinking? An LLM-Driven Large-Scale Study of Refactoring Motivations in Open-Source ProjectsMikel Robredo, Matteo Esposito, Fabio Palomba et al.
Context. Code refactoring improves software quality without changing external behavior. Despite its advantages, its benefits are hindered by the considerable cost of time, resources, and continuous effort it demands. Aim. Understanding why developers refactor, and which metrics capture these motivations, may support wider and more effective use of refactoring in practice. Method. We performed a large-scale empirical study to analyze developers refactoring activity, leveraging Large Language Models (LLMs) to identify underlying motivations from version control data, comparing our findings with previous motivations reported in the literature. Results. LLMs matched human judgment in 80% of cases, but aligned with literature-based motivations in only 47%. They enriched 22% of motivations with more detailed rationale, often highlighting readability, clarity, and structural improvements. Most motivations were pragmatic, focused on simplification and maintainability. While metrics related to developer experience and code readability ranked highest, their correlation with motivation categories was weak. Conclusions. We conclude that LLMs effectively capture surface-level motivations but struggle with architectural reasoning. Their value lies in providing localized explanations, which, when combined with software metrics, can form hybrid approaches. Such integration offers a promising path toward prioritizing refactoring more systematically and balancing short-term improvements with long-term architectural goals.
SEMay 2, 2025
Descriptor: C++ Self-Admitted Technical Debt Dataset (CppSATD)Phuoc Pham, Murali Sridharan, Matteo Esposito et al.
In software development, technical debt (TD) refers to suboptimal implementation choices made by the developers to meet urgent deadlines and limited resources, posing challenges for future maintenance. Self-Admitted Technical Debt (SATD) is a sub-type of TD, representing specific TD instances ``openly admitted'' by the developers and often expressed through source code comments. Previous research on SATD has focused predominantly on the Java programming language, revealing a significant gap in cross-language SATD. Such a narrow focus limits the generalizability of existing findings as well as SATD detection techniques across multiple programming languages. Our work addresses such limitation by introducing CppSATD, a dedicated C++ SATD dataset, comprising over 531,000 annotated comments and their source code contexts. Our dataset can serve as a foundation for future studies that aim to develop SATD detection methods in C++, generalize the existing findings to other languages, or contribute novel insights to cross-language SATD research.
CLJun 11, 2024
Beyond Words: On Large Language Models Actionability in Mission-Critical Risk AnalysisMatteo Esposito, Francesco Palagiano, Valentina Lenarduzzi et al.
Context. Risk analysis assesses potential risks in specific scenarios. Risk analysis principles are context-less; the same methodology can be applied to a risk connected to health and information technology security. Risk analysis requires a vast knowledge of national and international regulations and standards and is time and effort-intensive. A large language model can quickly summarize information in less time than a human and can be fine-tuned to specific tasks. Aim. Our empirical study aims to investigate the effectiveness of Retrieval-Augmented Generation and fine-tuned LLM in risk analysis. To our knowledge, no prior study has explored its capabilities in risk analysis. Method. We manually curated 193 unique scenarios leading to 1283 representative samples from over 50 mission-critical analyses archived by the industrial context team in the last five years. We compared the base GPT-3.5 and GPT-4 models versus their Retrieval-Augmented Generation and fine-tuned counterparts. We employ two human experts as competitors of the models and three other human experts to review the models and the former human experts' analysis. The reviewers analyzed 5,000 scenario analyses. Results and Conclusions. Human experts demonstrated higher accuracy, but LLMs are quicker and more actionable. Moreover, our findings show that RAG-assisted LLMs have the lowest hallucination rates, effectively uncovering hidden risks and complementing human expertise. Thus, the choice of model depends on specific needs, with FTMs for accuracy, RAG for hidden risks discovery, and base models for comprehensiveness and actionability. Therefore, experts can leverage LLMs as an effective complementing companion in risk analysis within a condensed timeframe. They can also save costs by averting unnecessary expenses associated with implementing unwarranted countermeasures.
SEAug 27, 2021
Towards a Methodology for Participant Selection in Software Engineering Experiments. A Vision of the FutureValentina Lenarduzzi, Oscar Dieste, Davide Fucci et al.
Background. Software Engineering (SE) researchers extensively perform experiments with human subjects. Well-defined samples are required to ensure external validity. Samples are selected \textit{purposely} or by \textit{convenience}, limiting the generalizability of results. Objective. We aim to depict the current status of participants selection in empirical SE, identifying the main threats and how they are mitigated. We draft a robust approach to participants' selection. Method. We reviewed existing participants' selection guidelines in SE, and performed a preliminary literature review to find out how participants' selection is conducted in SE in practice. % and 3) we summarized the main issues identified. Results. We outline a new selection methodology, by 1) defining the characteristics of the desired population, 2) locating possible sources of sampling available for researchers, and 3) identifying and reducing the "distance" between the selected sample and its corresponding population. Conclusion. We propose a roadmap to develop and empirically validate the selection methodology.
SEMay 29, 2021
Identification and Measurement of Technical Debt Requirements in Software Development: a Systematic Literature ReviewAna Melo, Roberta Fagundes, Valentina Lenarduzzi et al.
Context: Technical Debt requirements are related to the distance between the ideal value of the specification and the system's actual implementation, which are consequences of strategic decisions for immediate gains, or unintended changes in context. To ensure the evolution of the software, it is necessary to keep it managed. Identification and measurement are the first two stages of the management process; however, they are little explored in academic research in requirements engineering. Objective: We aimed at investigating which evidence helps to strengthen the process of TD requirements management, including identification and measurement. Method: We conducted a Systematic Literature Review through manual and automatic searches considering 7499 studies from 2010 to 2020, and including 61 primary studies. Results: We identified some causes related to Technical Debt requirements, existing strategies to help in the identification and measurement, and metrics to support the measurement stage. Conclusion: Studies on TD requirements are still preliminary, especially on management tools. Yet, not enough attention is given to interpersonal issues, which are difficulties encountered when performing such activities, and therefore also require research. Finally, the provision of metrics to help measure TD is part of this work's contribution, providing insights into the application in the requirements context.
SEMar 21, 2021
Fault Prediction based on Software Metrics and SonarQube Rules. Machine or Deep Learning?Francesco Lomio, Sergio Moreschini, Valentina Lenarduzzi
Background. Developers spend more time fixing bugs and refactoring the code to increase the maintainability than developing new features. Researchers investigated the code quality impact on fault-proneness focusing on code smells and code metrics. Objective. We aim at advancing fault-inducing commit prediction based on SonarQube considering the contribution provided by each rule and metric. Method. We designed and conducted a case study among 33 Java projects analyzed with SonarQube and SZZ to identify fault-inducing and fault-fixing commits. Moreover, we investigated fault-proneness of each SonarQube rule and metric using Machine and Deep Learning models. Results. We analyzed 77,932 commits that contain 40,890 faults and infected by more than 174 SonarQube rules violated 1,9M times, on which there was calculated 24 software metrics available by the tool. Compared to machine learning models, deep learning provide a more accurate fault detection accuracy and allowed us to accurately identify the fault-prediction power of each SonarQube rule. As a result, fourteen of the 174 violated rules has an importance higher than 1\% and account for 30\% of the total fault-proneness importance, while the fault proneness of the remaining 165 rules is negligible. Conclusion. Future works might consider the adoption of timeseries analysis and anomaly detection techniques to better and more accurately detect the rules that impact fault-proneness.
SEJan 21, 2021
A Critical Comparison on Six Static Analysis Tools: Detection, Agreement, and PrecisionValentina Lenarduzzi, Savanna Lujan, Nyyti Saarimaki et al.
Background. Developers use Automated Static Analysis Tools (ASATs) to control for potential quality issues in source code, including defects and technical debt. Tool vendors have devised quite a number of tools, which makes it harder for practitioners to select the most suitable one for their needs. To better support developers, researchers have been conducting several studies on ASATs to favor the understanding of their actual capabilities. Aims. Despite the work done so far, there is still a lack of knowledge regarding (1) which source quality problems can actually be detected by static analysis tool warnings, (2) what is their agreement, and (3) what is the precision of their recommendations. We aim at bridging this gap by proposing a large-scale comparison of six popular static analysis tools for Java projects: Better Code Hub, CheckStyle, Coverity Scan, Findbugs, PMD, and SonarQube. Method. We analyze 47 Java projects and derive a taxonomy of warnings raised by 6 state-of-the-practice ASATs. To assess their agreement, we compared them by manually analyzing - at line-level - whether they identify the same issues. Finally, we manually evaluate the precision of the tools. Results. The key results report a comprehensive taxonomy of ASATs warnings, show little to no agreement among the tools and a low degree of precision. Conclusions. We provide a taxonomy that can be useful to researchers, practitioners, and tool vendors to map the current capabilities of the tools. Furthermore, our study provides the first overview on the agreement among different tools as well as an extensive analysis of their precision.
SENov 12, 2020
A Fine-grained Data Set and Analysis of Tangling in Bug Fixing CommitsSteffen Herbold, Alexander Trautsch, Benjamin Ledel et al.
Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.
SEOct 7, 2020
Empirical Standards for Software Engineering ResearchPaul Ralph, Nauman bin Ali, Sebastian Baltes et al.
Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around research best practices, will improve research quality and make peer review more effective, reliable, transparent and fair.
SESep 19, 2019
From Monolithic Systems to Microservices: An Assessment FrameworkFlorian Auer, Valentina Lenarduzzi, Michael Felderer et al.
Context. Re-architecting monolithic systems with Microservices-based architecture is a common trend. Various companies are migrating to Microservices for different reasons. However, making such an important decision like re-architecting an entire system must be based on real facts and not only on gut feelings. Objective. The goal of this work is to propose an evidence-based decision support framework for companies that need to migrate to Microservices, based on the analysis of a set of characteristics and metrics they should collect before re-architecting their monolithic system. Method. We designed this study with a mixed-methods approach combining a Systematic Mapping Study with a survey done in the form of interviews with professionals to derive the assessment framework based on Grounded Theory. Results. We identified a set consisting of information and metrics that companies can use to decide whether to migrate to Microservices or not. The proposed assessment framework, based on the aforementioned metrics, could be useful for companies if they need to migrate to Microservices and do not want to run the risk of failing to consider some important information.
SEAug 30, 2019
Some SonarQube Issues have a Significant but SmallEffect on Faults and Changes. A large-scale empirical studyValentina Lenarduzzi, Nyyti Saarimäki, Davide Taibi
Context. Companies commonly invest effort to remove technical issues believed to impact software qualities, such as removing anti-patterns or coding styles violations. Objective. Our aim is to analyze the diffuseness of Technical Debt (TD) items in software systems and to assess their impact on code changes and fault-proneness, considering also the type of TD items and their severity. Method. We conducted a case study among 33 Java projects from the Apache Software Foundation (ASF) repository. We analyzed 726 commits containing 27K faults and 12M changes. The projects violated 173 SonarQube rules generating more than 95K TD items in more than 200K classes. Results. Clean classes (classes not affected by TD items) are less change-prone than dirty ones, but the difference between the groups is small. Clean classes are slightly more change-prone than classes affected by TD items of type Code Smell or Security Vulnerability. As for fault-proneness, there is no difference between clean and dirty classes. Moreover, we found a lot of incongruities in the type and severity level assigned by SonarQube. Conclusions. Our result can be useful for practitioners to understand which TD items they should refactor and for researchers to bridge the missing gaps. They can also support companies and tool vendors in identifying TD items as accurately as possible.
SEAug 27, 2019
Continuous Architecting with Microservices and DevOps: A Systematic Mapping StudyDavide Taibi, Valentina Lenarduzzi, Claus Pahl
Context: Several companies are migrating their information systems into the Cloud. Microservices and DevOps are two of the most common adopted technologies. However, there is still a lack of understanding how to adopt a microservice-based architectural style and which tools and technique to use in a continuous architecting pipeline. Objective: We aim at characterizing the different microservice architectural style principles and patterns in order to map existing tools and techniques adopted in the context of DevOps. Methodology: We conducted a Systematic Mapping Study identifying the goal and the research questions, the bibliographic sources, the search strings, and the selection criteria to retrieve the most relevant papers. Results: We identified several agreed microservice architectural principles and patterns widely adopted and reported in 23 case studies, together with a summary of the advantages, disadvantages, and lessons learned for each pattern from the case studies. Finally, we mapped the existing microservices-specific techniques in order to understand how to continuously deliver value in a DevOps pipeline. We depicted the current research, reporting gaps and trends. Conclusion: Different patterns emerge for different migration, orchestration, storage and deployment settings. The results also show the lack of empirical work on microservices-specific techniques, especially for the release phase in DevOps.
SEAug 12, 2019
Microservices Anti Patterns: A TaxonomyDavide Taibi, Valentina Lenarduzzi, Claus Pahl
Several companies are re-architecting their monolithic information systems with microservices. However, many companies migrated without experience on microservices, mainly learning how to migrate from books or from practitioners' blogs. Because of the novelty of the topic, practitioners and consultancy are learning by doing how to migrate, thus facing several issues but also several benefits. In this chapter, we introduce a catalog and a taxonomy of the most common microservices anti-patterns in order to identify common problems. Our anti-pattern catalogue is based on the experience summarized by different practitioners we interviewed in the last three years. We identified a taxonomy of 20 anti-patterns, including organizational (team oriented and technology/tool oriented) anti-patterns and technical (internal and communication) anti-patterns. The results can be useful to practitioners to avoid experiencing the same difficult situations in the systems they develop. Moreover, researchers can benefit of this catalog and further validate the harmfulness of the anti-patterns identified.
SEAug 5, 2019
An Empirical Study on Technical Debt in a Finnish SMEValentina Lenarduzzi, Teemu Orava, Nyyti Saarimäki et al.
Objective. In this work, we report the experience of a Finnish SME in managing Technical Debt (TD), investigating the most common types of TD they faced in the past, their causes, and their effects. Method. We set up a focus group in the case-company, involving different roles. Results. The results showed that the most significant TD in the company stems from disagreements with the supplier and lack of test automation. Specification and test TD are the most significant types of TD. Budget and time constraints were identified as the most important root causes of TD. Conclusion. TD occurs when time or budget is limited or the amount of work are not understood properly. However, not all postponed activities generated "debt". Sometimes the accumulation of TD helped meet deadlines without a major impact, while in other cases the cost for repaying the TD was much higher than the benefits. From this study, we learned that learning, careful estimations, and continuous improvement could be good strategies to mitigate TD. These strategies include iterative validation with customers, efficient communication with stakeholders, meta-cognition in estimations, and value orientation in budgeting and scheduling.
SEAug 2, 2019
The Technical Debt DatasetValentina Lenarduzzi, Nyyti Saarimäki, Davide Taibi
Technical Debt analysis is increasing in popularity as nowadays researchers and industry are adopting various tools for static code analysis to evaluate the quality of their code. Despite this, empirical studies on software projects are expensive because of the time needed to analyze the projects. In addition, the results are difficult to compare as studies commonly consider different projects. In this work, we propose the Technical Debt Dataset, a curated set of project measurement data from 33 Java projects from the Apache Software Foundation. In the Technical Debt Dataset, we analyzed all commits from separately defined time frames with SonarQube to collect Technical Debt information and with Ptidej to detect code smells. Moreover, we extracted all available commit information from the git logs, the refactoring applied with Refactoring Miner, and fault information reported in the issue trackers (Jira). Using this information, we executed the SZZ algorithm to identify the fault-inducing and -fixing commits. We analyzed 78K commits from the selected 33 projects, detecting 1.8M SonarQube issues, 38K code smells, 28K faults and 57K refactorings. The project analysis took more than 200 days. In this paper, we describe the data retrieval pipeline together with the tools used for the analysis. The dataset is made available through CSV files and an SQLite database to facilitate queries on the data. The Technical Debt Dataset aims to open up diverse opportunities for Technical Debt research, enabling researchers to compare results on common projects.
SEAug 2, 2019
Towards Surgically-Precise Technical Debt Estimation: Early Results and Research RoadmapValentina Lenarduzzi, Antonio Martini, Davide Taibi et al.
The concept of technical debt has been explored from many perspectives but its precise estimation is still under heavy empirical and experimental inquiry. We aim to understand whether, by harnessing approximate, data-driven, machine-learning approaches it is possible to improve the current techniques for technical debt estimation, as represented by a top industry quality analysis tool such as SonarQube. For the sake of simplicity, we focus on relatively simple regression modelling techniques and apply them to modelling the additional project cost connected to the sub-optimal conditions existing in the projects under study. Our results shows that current techniques can be improved towards a more precise estimation of technical debt and the case study shows promising results towards the identification of more accurate estimation of technical debt.
SEJul 25, 2019
Towards an Holistic Definition of Requirements DebtValentina Lenarduzzi, Davide Fucci
When not appropriately managed, technical debt is considered to have negative effects on the long term success of a software project. However, how the debt metaphor applies to requirements engineering in general, and to requirements engineering activities in particular, is not well understood. Grounded in the existing literature, we present a holistic definition of requirements debt which include debt incurred during the identification, formalization, and implementation of requirements. We outline future assessment to validate and further refine our proposed definition. This conceptualization is the first step towards a requirements debt monitoring framework to support stakeholders decisions, such as when to incur and eventually pay back requirements debt, and at what costs
SEApr 29, 2019
Technical Debt Prioritization: State of the Art. A Systematic Literature ReviewValentina Lenarduzzi, Terese Besker, Davide Taibi et al.
Background. Software companies need to manage and refactor Technical Debt issues. Therefore, it is necessary to understand if and when refactoring Technical Debt should be prioritized with respect to developing features or fixing bugs. Objective. The goal of this study is to investigate the existing body of knowledge in software engineering to understand what Technical Debt prioritization approaches have been proposed in research and industry. Method. We conducted a Systematic Literature Review among 384 unique papers published until 2018, following a consolidated methodology applied in Software Engineering. We included 38 primary studies. Results. Different approaches have been proposed for Technical Debt prioritization, all having different goals and optimizing on different criteria. The proposed measures capture only a small part of the plethora of factors used to prioritize Technical Debt qualitatively in practice. We report an impact map of such factors. However, there is a lack of empirical and validated set of tools. Conclusion. We observed that technical Debt prioritization research is preliminary and there is no consensus on what are the important factors and how to measure them. Consequently, we cannot consider current research conclusive and in this paper, we outline different directions for necessary future investigations.
SEApr 26, 2019
Are Architectural Smells Independent from Code Smells? An Empirical StudyFrancesca Arcelli Fontanaa, Valentina Lenarduzzi, Riccardo Roveda et al.
Background. Architectural smells and code smells are symptoms of bad code or design that can cause different quality problems, such as faults, technical debt, or difficulties with maintenance and evolution. Some studies show that code smells and architectural smells often appear together in the same file. The correlation between code smells and architectural smells, however, is not clear yet; some studies on a limited set of projects have claimed that architectural smells can be derived from code smells, while other studies claim the opposite. Objective. The goal of this work is to understand whether architectural smells are independent from code smells or can be derived from a code smell or from one category of them. Method. We conducted a case study analyzing the correlations among 19 code smells, six categories of code smells, and four architectural smells. Results. The results show that architectural smells are correlated with code smells only in a very low number of occurrences and therefore cannot be derived from code smells. Conclusion. Architectural smells are independent from code smells, and therefore deserve special attention by researchers, who should investigate their actual harmfulness, and practitioners, who should consider whether and when to remove them.
SEFeb 17, 2019
Does Migrate a Monolithic System to Microservices Decrease the Technical Debt?Valentina Lenarduzzi, Francesco Lomio, Nyyti Saarimäki et al.
Background. The migration from monolithic systems to microservices involves deep refactoring of the systems. Therefore, the migration usually has a big economic impact and companies tend to postpone several activities during this process, mainly to speed-up the migration itself, but also because of the need to release new features. Objective. We monitored the Technical Debt of a small and medium enterprise while migrating a legacy monolithic system to an ecosystem of microservices to analyze changes in the code technical debt before and after the migration to microservices. Method. We conducted a case study analyzing more than four years of the history of a big project (280K Lines of Code) where two teams extracted five business processes from the monolithic system as microservices, by first analyzing the Technical Debt with SonarQube and then performing a qualitative study with the developers to understand the perceived quality of the system and the motivation for eventually postponed activities. Result. The development of microservices helps to reduce the Technical Debt in the long run. Despite an initial spike in the Technical Debt, due to the development of the new microservice, after a relatively short period, the Technical Debt tends to grow slower than in the monolithic system.
SEOct 25, 2018
Microservices, Continuous Architecture, and Technical Debt Interest: An Empirical StudyValentina Lenarduzzi, Davide Taibi
Continuous Architecture (CA) is an approach that supports companies in decreasing the time between deliveries. Migration to microservices is one of the most common situations when companies adopt continuous architecting processes [4]. Companies commonly adopt an initial migration strategy to extract some components from the monolithic system as microservices, making use of simplified microservices patterns [5][4]. As an example, companies commonly directly connect the microservices to the legacy monolithic system and do not adopt any message bus at the beginning. When the system starts to grow in complexity, they usually start re-architecting their systems, considering different architectural patterns. Some companies introduce API gateway patterns to simplify the management and load balancing of the different services, while others also consider a lightweight message bus [4][5][6]. All these architectural changes require deep refactoring of the system, thereby increasing the risk of new faults being introduced. In this paper, we report the preliminary results of work in progress, where we monitored the TD of an SME (SMEs = small and medium enterprises) that adopted a CA approach while migrating a legacy monolithic system to an ecosystem of microservices. To the best of our knowledge, no studies exist on the impact of postponed activities on the TD, especially in the context of CA and microservices. This work will help companies understand how TD grows and changes over time while at the same time opening up new avenues for future research on the analysis of TD interest in continuous architecting processes.