SEJun 2
Analyzing the Evolution of Structural Communities within Microservice ArchitectureAlexander Bakhtin, Matteo Esposito, Valentina Lenarduzzi et al.
In recent years, the detection of anti-patterns in microservice architecture has gained traction, particularly to identify instances of Microservice Architectural Degradation. In such tasks, the microservice architecture is often modeled as a network of microservice dependencies. Recent works have explored how to assess the evolution of such architectural networks by considering the architecture of consecutive releases of the project. Particular anti-patterns related to the structure of the service network include Wrong cuts and Knot services. Community detection is a way to identify groups of services in a network that strongly depend on each other. If such groups cannot be mapped to business processes in the system, or if the same service belongs to multiple communities, this could indicate architectural degradation due to an inappropriate division of responsibilities or unoptimized communication. Temporal community detection methods have been proposed to analyze community structure that evolves in time. We performed temporal community detection within the microservice architecture of six releases of the train-ticket benchmark and analyzed the composition of the discovered communities and their activities over time. We observed a stable architecture with a clear separation of services into two communities, which we could identify with two business processes performed by the system. We found services belonging to several communities, as well as services within the same community with both incoming and outgoing connections. The membership strength metric provided by the leveraged algorithm enables fine-grained assessment of the microservice communities.
SEJan 5Code
The Invisible Hand of AI Libraries Shaping Open Source Projects and CommunitiesMatteo Esposito, Andrea Janes, Valentina Lenarduzzi et al.
In the early 1980s, Open Source Software emerged as a revolutionary concept amidst the dominance of proprietary software. What began as a revolutionary idea has now become the cornerstone of computer science. Amidst OSS projects, AI is increasing its presence and relevance. However, despite the growing popularity of AI, its adoption and impacts on OSS projects remain underexplored. We aim to assess the adoption of AI libraries in Python and Java OSS projects and examine how they shape development, including the technical ecosystem and community engagement. To this end, we will perform a large-scale analysis on 157.7k potential OSS repositories, employing repository metrics and software metrics to compare projects adopting AI libraries against those that do not. We expect to identify measurable differences in development activity, community engagement, and code complexity between OSS projects that adopt AI libraries and those that do not, offering evidence-based insights into how AI integration reshapes software development practices.
SENov 14, 2025Code
SQuaD: The Software Quality DatasetMikel Robredo, Matteo Esposito, Davide Taibi et al.
Software quality research increasingly relies on large-scale datasets that measure both the product and process aspects of software systems. However, existing resources often focus on limited dimensions, such as code smells, technical debt, or refactoring activity, thereby restricting comprehensive analyses across time and quality dimensions. To address this gap, we present the Software Quality Dataset (SQuaD), a multi-dimensional, time-aware collection of software quality metrics extracted from 450 mature open-source projects across diverse ecosystems, including Apache, Mozilla, FFmpeg, and the Linux kernel. By integrating nine state-of-the-art static analysis tools, i.e., SonarQube, CodeScene, PMD, Understand, CK, JaSoMe, RefactoringMiner, RefactoringMiner++, and PyRef, our dataset unifies over 700 unique metrics at method, class, file, and project levels. Covering a total of 63,586 analyzed project releases, SQuaD also provides version control and issue-tracking histories, software vulnerability data (CVE/CWE), and process metrics proven to enhance Just-In-Time (JIT) defect prediction. The SQuaD enables empirical research on maintainability, technical debt, software evolution, and quality assessment at unprecedented scale. We also outline emerging research directions, including automated dataset updates and cross-project quality modeling to support the continuous evolution of software analytics. The dataset is publicly available on ZENODO (DOI: 10.5281/zenodo.17566690).
SEJul 8, 2024
6GSoft: Software for Edge-to-Cloud ContinuumMuhammad Azeem Akbar, Matteo Esposito, Sami Hyrynsalmi et al.
In the era of 6G, developing and managing software requires cutting-edge software engineering (SE) theories and practices tailored for such complexity across a vast number of connected edge devices. Our project aims to lead the development of sustainable methods and energy-efficient orchestration models specifically for edge environments, enhancing architectural support driven by AI for contemporary edge-to-cloud continuum computing. This initiative seeks to position Finland at the forefront of the 6G landscape, focusing on sophisticated edge orchestration and robust software architectures to optimize the performance and scalability of edge networks. Collaborating with leading Finnish universities and companies, the project emphasizes deep industry-academia collaboration and international expertise to address critical challenges in edge orchestration and software architecture, aiming to drive significant advancements in software productivity and market impact.
SEJan 5
A Defect is Being Born: How Close Are We? A Time Sensitive Forecasting ApproachMikel Robredo, Matteo Esposito, Fabio Palomba et al.
Background. Defect prediction has been a highly active topic among researchers in the Empirical Software Engineering field. Previous literature has successfully achieved the most accurate prediction of an incoming fault and identified the features and anomalies that precede it through just-in-time prediction. As software systems evolve continuously, there is a growing need for time-sensitive methods capable of forecasting defects before they manifest. Aim. Our study seeks to explore the effectiveness of time-sensitive techniques for defect forecasting. Moreover, we aim to investigate the early indicators that precede the occurrence of a defect. Method. We will train multiple time-sensitive forecasting techniques to forecast the future bug density of a software project, as well as identify the early symptoms preceding the occurrence of a defect. Expected results. Our expected results are translated into empirical evidence on the effectiveness of our approach for early estimation of bug proneness.
SEFeb 5
Sovereign-by-Design A Reference Architecture for AI and Blockchain Enabled SystemsMatteo Esposito, Lodovica Marchesi, Roberto Tonelli et al.
Digital sovereignty has emerged as a central concern for modern software-intensive systems, driven by the dominance of non-sovereign cloud infrastructures, the rapid adoption of Generative AI, and increasingly stringent regulatory requirements. While existing initiatives address governance, compliance, and security in isolation, they provide limited guidance on how sovereignty can be operationalized at the architectural level. In this paper, we argue that sovereignty must be treated as a first-class architectural property rather than a purely regulatory objective. We introduce a Sovereign Reference Architecture that integrates self-sovereign identity, blockchain-based trust and auditability, sovereign data governance, and Generative AI deployed under explicit architectural control. The architecture explicitly captures the dual role of Generative AI as both a source of governance risk and an enabler of compliance, accountability, and continuous assurance when properly constrained. By framing sovereignty as an architectural quality attribute, our work bridges regulatory intent and concrete system design, offering a coherent foundation for building auditable, evolvable, and jurisdiction-aware AI-enabled systems. The proposed reference architecture provides a principled starting point for future research and practice at the intersection of software architecture, Generative AI, and digital sovereignty.
SEMay 12
A Research Agenda on Agents and Software Engineering: Outcomes from the Rio A2SE SeminarDavide Taibi, Henry Muccini, Karthik Vaidhyanathan et al.
The rise of agentic AI is reshaping software engineering in two intertwined directions: agents are increasingly applied to support software engineering tasks, and Agentic AI systems themselves are complex systems that require re-thinking currently established software engineering practices. To chart a coherent research agenda covering the two directions, we organized the A2SE seminar in Rio de Janeiro, bringing together 18 experts from academia and industry. Through structured presentations, collaborative topic clustering, and focused group discussions, participants identified six thematic areas: Governance, Software Engineering for Agents, Agents for Software Architecture, Quality and Evaluation, Sustainability, and Code, and they prioritized short-term and long-term research directions for each. This paper presents the resulting community-driven, opinionated research agenda, offering the SE community a structured foundation for coordinating efforts at this critical juncture.
SENov 3, 2025Code
Hidden in Plain Sight: Where Developers Confess Self-Admitted Technical DebtMurali Sridharan, Mikel Robredo, Leevi Rantala et al.
Context. Detecting Self-Admitted Technical Debt (SATD) is crucial for proactive software maintenance. Previous research has primarily targeted detecting and prioritizing SATD, with little focus on the source code afflicted with SATD. Our goal in this work is to connect the SATD comments with source code constructs that surround them. Method. We leverage the extensive SATD dataset PENTACET, containing code comments from over 9000 Java Open Source Software (OSS) repositories. We quantitatively infer where SATD most commonly occurs and which code constructs/statements it most frequently affects. Results and Conclusions. Our large-scale study links over 225,000 SATD comments to their surrounding code, showing that SATD mainly arises in inline code near definitions, conditionals, and exception handling, where developers face uncertainty and trade-offs, revealing it as an intentional signal of awareness during change rather than mere neglect.
SEMar 23, 2024
Leveraging Large Language Models for Preliminary Security Risk Analysis: A Mission-Critical Case StudyMatteo Esposito, Francesco Palagiano
Preliminary security risk analysis (PSRA) provides a quick approach to identify, evaluate and propose remeditation to potential risks in specific scenarios. The extensive expertise required for an effective PSRA and the substantial ammount of textual-related tasks hinder quick assessments in mission-critical contexts, where timely and prompt actions are essential. The speed and accuracy of human experts in PSRA significantly impact response time. A large language model can quickly summarise information in less time than a human. To our knowledge, no prior study has explored the capabilities of fine-tuned models (FTM) in PSRA. Our case study investigates the proficiency of FTM to assist practitioners in PSRA. We manually curated 141 representative samples from over 50 mission-critical analyses archived by the industrial context team in the last five years.We compared the proficiency of the FTM versus seven human experts. Within the industrial context, our approach has proven successful in reducing errors in PSRA, hastening security risk detection, and minimizing false positives and negatives. This translates to cost savings for the company by averting unnecessary expenses associated with implementing unwarranted countermeasures. Therefore, experts can focus on more comprehensive risk analysis, leveraging LLMs for an effective preliminary assessment within a condensed timeframe.
CRDec 16, 2024
On Large Language Models in Mission-Critical IT Governance: Are We Ready Yet?Matteo Esposito, Francesco Palagiano, Valentina Lenarduzzi et al.
Context. The security of critical infrastructure has been a pressing concern since the advent of computers and has become even more critical in today's era of cyber warfare. Protecting mission-critical systems (MCSs), essential for national security, requires swift and robust governance, yet recent events reveal the increasing difficulty of meeting these challenges. Aim. Building on prior research showcasing the potential of Generative AI (GAI), such as Large Language Models, in enhancing risk analysis, we aim to explore practitioners' views on integrating GAI into the governance of IT MCSs. Our goal is to provide actionable insights and recommendations for stakeholders, including researchers, practitioners, and policymakers. Method. We designed a survey to collect practical experiences, concerns, and expectations of practitioners who develop and implement security solutions in the context of MCSs. Conclusions and Future Works. Our findings highlight that the safe use of LLMs in MCS governance requires interdisciplinary collaboration. Researchers should focus on designing regulation-oriented models and focus on accountability; practitioners emphasize data protection and transparency, while policymakers must establish a unified AI framework with global benchmarks to ensure ethical and secure LLMs-based MCS governance.
SEMar 17, 2025
Generative AI for Software Architecture. Applications, Challenges, and Future DirectionsMatteo Esposito, Xiaozhou Li, Sergio Moreschini et al.
Context: Generative Artificial Intelligence (GenAI) is transforming much of software development, yet its application in software architecture is still in its infancy, and no prior study has systematically addressed the topic. Aim: We aim to systematically synthesize the use, rationale, contexts, usability, and future challenges of GenAI in software architecture. Method: We performed a multivocal literature review (MLR), analyzing peer-reviewed and gray literature, identifying current practices, models, adoption contexts, and reported challenges, extracting themes via open coding. Results: Our review identified significant adoption of GenAI for architectural decision support and architectural reconstruction. OpenAI GPT models are predominantly applied, and there is consistent use of techniques such as few-shot prompting and retrieved-augmented generation (RAG). GenAI has been applied mostly to initial stages of the Software Development Life Cycle (SDLC), such as Requirements-to-Architecture and Architecture-to-Code. Monolithic and microservice architectures were the dominant targets. However, rigorous testing of GenAI outputs was typically missing from the studies. Among the most frequent challenges are model precision, hallucinations, ethical aspects, privacy issues, lack of architecture-specific datasets, and the absence of sound evaluation frameworks. Conclusions: GenAI shows significant potential in software design, but several challenges remain on its path to greater adoption. Research efforts should target designing general evaluation methodologies, handling ethics and precision, increasing transparency and explainability, and promoting architecture-specific datasets and benchmarks to bridge the gap between theoretical possibilities and practical use.
SEJan 22, 2025
A Call for Critically Rethinking and Reforming Data Analysis in Empirical Software EngineeringMatteo Esposito, Mikel Robredo, Murali Sridharan et al.
Context: Empirical Software Engineering (ESE) drives innovation in SE through qualitative and quantitative studies. However, concerns about the correct application of empirical methodologies have existed since the 2006 Dagstuhl seminar on SE. Objective: To analyze three decades of SE research, identify mistakes in statistical methods, and evaluate experts' ability to detect and address these issues. Methods: We conducted a literature survey of ~27,000 empirical studies, using LLMs to classify statistical methodologies as adequate or inadequate. Additionally, we selected 30 primary studies and held a workshop with 33 ESE experts to assess their ability to identify and resolve statistical issues. Results: Significant statistical issues were found in the primary studies, and experts showed limited ability to detect and correct these methodological problems, raising concerns about the broader ESE community's proficiency in this area. Conclusions. Despite our study's eventual limitations, its results shed light on recurring issues from promoting information copy-and-paste from past authors' works and the continuous publication of inadequate approaches that promote dubious results and jeopardize the spread of the correct statistical strategies among researchers. Besides, it justifies further investigation into empirical rigor in software engineering to expose these recurring issues and establish a framework for reassessing our field's foundation of statistical methodology application. Therefore, this work calls for critically rethinking and reforming data analysis in empirical software engineering, paving the way for our work soon.
SEJun 27, 2025
Autonomic Microservice Management via Agentic AI and MAPE-K IntegrationMatteo Esposito, Alexander Bakhtin, Noman Ahmad et al.
While microservices are revolutionizing cloud computing by offering unparalleled scalability and independent deployment, their decentralized nature poses significant security and management challenges that can threaten system stability. We propose a framework based on MAPE-K, which leverages agentic AI, for autonomous anomaly detection and remediation to address the daunting task of highly distributed system management. Our framework offers practical, industry-ready solutions for maintaining robust and secure microservices. Practitioners and researchers can customize the framework to enhance system stability, reduce downtime, and monitor broader system quality attributes such as system performance level, resilience, security, and anomaly management, among others.
SEMar 31
Making Sense of AI Agents Hype: Adoption, Architectures, and Takeaways from PractitionersRuoyu Su, Matteo Esposito, Roberta Capuano et al.
To support practitioners in understanding how agentic systems are designed in real-world industrial practice, we present a review of practitioner conference talks on AI agents. We analyzed 138 recorded talks to examine how companies adopt agent-based architectures (Objective 1), identify recurring architectural strategies and patterns (Objective 2), and analyze application domains and technologies used to implement and operate LLM-driven agentic systems (Objective 3).
SESep 9, 2025
What Were You Thinking? An LLM-Driven Large-Scale Study of Refactoring Motivations in Open-Source ProjectsMikel Robredo, Matteo Esposito, Fabio Palomba et al.
Context. Code refactoring improves software quality without changing external behavior. Despite its advantages, its benefits are hindered by the considerable cost of time, resources, and continuous effort it demands. Aim. Understanding why developers refactor, and which metrics capture these motivations, may support wider and more effective use of refactoring in practice. Method. We performed a large-scale empirical study to analyze developers refactoring activity, leveraging Large Language Models (LLMs) to identify underlying motivations from version control data, comparing our findings with previous motivations reported in the literature. Results. LLMs matched human judgment in 80% of cases, but aligned with literature-based motivations in only 47%. They enriched 22% of motivations with more detailed rationale, often highlighting readability, clarity, and structural improvements. Most motivations were pragmatic, focused on simplification and maintainability. While metrics related to developer experience and code readability ranked highest, their correlation with motivation categories was weak. Conclusions. We conclude that LLMs effectively capture surface-level motivations but struggle with architectural reasoning. Their value lies in providing localized explanations, which, when combined with software metrics, can form hybrid approaches. Such integration offers a promising path toward prioritizing refactoring more systematically and balancing short-term improvements with long-term architectural goals.
SEMay 2, 2025
Descriptor: C++ Self-Admitted Technical Debt Dataset (CppSATD)Phuoc Pham, Murali Sridharan, Matteo Esposito et al.
In software development, technical debt (TD) refers to suboptimal implementation choices made by the developers to meet urgent deadlines and limited resources, posing challenges for future maintenance. Self-Admitted Technical Debt (SATD) is a sub-type of TD, representing specific TD instances ``openly admitted'' by the developers and often expressed through source code comments. Previous research on SATD has focused predominantly on the Java programming language, revealing a significant gap in cross-language SATD. Such a narrow focus limits the generalizability of existing findings as well as SATD detection techniques across multiple programming languages. Our work addresses such limitation by introducing CppSATD, a dedicated C++ SATD dataset, comprising over 531,000 annotated comments and their source code contexts. Our dataset can serve as a foundation for future studies that aim to develop SATD detection methods in C++, generalize the existing findings to other languages, or contribute novel insights to cross-language SATD research.
SEJan 9, 2025
Bringing Order Amidst Chaos: On the Role of Artificial Intelligence in Secure Software EngineeringMatteo Esposito
Context. Developing secure and reliable software remains a key challenge in software engineering (SE). The ever-evolving technological landscape offers both opportunities and threats, creating a dynamic space where chaos and order compete. Secure software engineering (SSE) must continuously address vulnerabilities that endanger software systems and carry broader socio-economic risks, such as compromising critical national infrastructure and causing significant financial losses. Researchers and practitioners have explored methodologies like Static Application Security Testing Tools (SASTTs) and artificial intelligence (AI) approaches, including machine learning (ML) and large language models (LLMs), to detect and mitigate these vulnerabilities. Each method has unique strengths and limitations. Aim. This thesis seeks to bring order to the chaos in SSE by addressing domain-specific differences that impact AI accuracy. Methodology. The research employs a mix of empirical strategies, such as evaluating effort-aware metrics, analyzing SASTTs, conducting method-level analysis, and leveraging evidence-based techniques like systematic dataset reviews. These approaches help characterize vulnerability prediction datasets. Results. Key findings include limitations in static analysis tools for identifying vulnerabilities, gaps in SASTT coverage of vulnerability types, weak relationships among vulnerability severity scores, improved defect prediction accuracy using just-in-time modeling, and threats posed by untouched methods. Conclusions. This thesis highlights the complexity of SSE and the importance of contextual knowledge in improving AI-driven vulnerability and defect prediction. The comprehensive analysis advances effective prediction models, benefiting both researchers and practitioners.
CLJun 11, 2024
Beyond Words: On Large Language Models Actionability in Mission-Critical Risk AnalysisMatteo Esposito, Francesco Palagiano, Valentina Lenarduzzi et al.
Context. Risk analysis assesses potential risks in specific scenarios. Risk analysis principles are context-less; the same methodology can be applied to a risk connected to health and information technology security. Risk analysis requires a vast knowledge of national and international regulations and standards and is time and effort-intensive. A large language model can quickly summarize information in less time than a human and can be fine-tuned to specific tasks. Aim. Our empirical study aims to investigate the effectiveness of Retrieval-Augmented Generation and fine-tuned LLM in risk analysis. To our knowledge, no prior study has explored its capabilities in risk analysis. Method. We manually curated 193 unique scenarios leading to 1283 representative samples from over 50 mission-critical analyses archived by the industrial context team in the last five years. We compared the base GPT-3.5 and GPT-4 models versus their Retrieval-Augmented Generation and fine-tuned counterparts. We employ two human experts as competitors of the models and three other human experts to review the models and the former human experts' analysis. The reviewers analyzed 5,000 scenario analyses. Results and Conclusions. Human experts demonstrated higher accuracy, but LLMs are quicker and more actionable. Moreover, our findings show that RAG-assisted LLMs have the lowest hallucination rates, effectively uncovering hidden risks and complementing human expertise. Thus, the choice of model depends on specific needs, with FTMs for accuracy, RAG for hidden risks discovery, and base models for comprehensiveness and actionability. Therefore, experts can leverage LLMs as an effective complementing companion in risk analysis within a condensed timeframe. They can also save costs by averting unnecessary expenses associated with implementing unwarranted countermeasures.
SEJun 10, 2024
$Classi|Q\rangle$ Towards a Translation Framework To Bridge The Classical-Quantum Programming GapMatteo Esposito, Maryam Tavassoli Sabzevari, Boshuai Ye et al.
Quantum computing, albeit readily available as hardware or emulated on the cloud, is still far from being available in general regarding complex programming paradigms and learning curves. This vision paper introduces $Classi|Q\rangle$, a translation framework idea to bridge Classical and Quantum Computing by translating high-level programming languages, e.g., Python or C++, into a low-level language, e.g., Quantum Assembly. Our idea paper serves as a blueprint for ongoing efforts in quantum software engineering, offering a roadmap for further $Classi|Q\rangle$ development to meet the diverse needs of researchers and practitioners. $Classi|Q\rangle$ is designed to empower researchers and practitioners with no prior quantum experience to harness the potential of hybrid quantum computation. We also discuss future enhancements to $Classi|Q\rangle$, including support for additional quantum languages, improved optimization strategies, and integration with emerging quantum computing platforms.