Antonio Martini

SE
h-index7
13papers
117citations
Novelty26%
AI Score25

13 Papers

AIAug 17, 2024
Maintainability Challenges in ML: A Systematic Literature Review

Karthik Shivashankar, Antonio Martini

Background: As Machine Learning (ML) advances rapidly in many fields, it is being adopted by academics and businesses alike. However, ML has a number of different challenges in terms of maintenance not found in traditional software projects. Identifying what causes these maintainability challenges can help mitigate them early and continue delivering value in the long run without degrading ML performance. Aim: This study aims to identify and synthesise the maintainability challenges in different stages of the ML workflow and understand how these stages are interdependent and impact each other's maintainability. Method: Using a systematic literature review, we screened more than 13000 papers, then selected and qualitatively analysed 56 of them. Results: (i) a catalogue of maintainability challenges in different stages of Data Engineering, Model Engineering workflows and the current challenges when building ML systems are discussed; (ii) a map of 13 maintainability challenges to different interdependent stages of ML that impact the overall workflow; (iii) Provided insights to developers of ML tools and researchers. Conclusions: In this study, practitioners and organisations will learn about maintainability challenges and their impact at different stages of ML workflow. This will enable them to avoid pitfalls and help to build a maintainable ML system. The implications and challenges will also serve as a basis for future research to strengthen our understanding of the ML system's maintainability.

SEAug 17, 2024
Better Python Programming for all: With the focus on Maintainability

Karthik Shivashankar, Antonio Martini

This study aims to enhance the maintainability of code generated by Large Language Models (LLMs), with a focus on the Python programming language. As the use of LLMs for coding assistance grows, so do concerns about the maintainability of the code they produce. Previous research has mainly concentrated on the functional accuracy and testing success of generated code, overlooking aspects of maintainability. Our approach involves the use of a specifically designed dataset for training and evaluating the model, ensuring a thorough assessment of code maintainability. At the heart of our work is the fine-tuning of an LLM for code refactoring, aimed at enhancing code readability, reducing complexity, and improving overall maintainability. After fine-tuning an LLM to prioritize code maintainability, our evaluations indicate that this model significantly improves code maintainability standards, suggesting a promising direction for the future of AI-assisted software development.

SEAug 17, 2024
Identifying Technical Debt and Its Types Across Diverse Software Projects Issues

Karthik Shivashankar, Mili Orucevic, Maren Maritsdatter Kruke et al.

Technical Debt (TD) identification in software projects issues is crucial for maintaining code quality, reducing long-term maintenance costs, and improving overall project health. This study advances TD classification using transformer-based models, addressing the critical need for accurate and efficient TD identification in large-scale software development. Our methodology employs multiple binary classifiers for TD and its type, combined through ensemble learning, to enhance accuracy and robustness in detecting various forms of TD. We train and evaluate these models on a comprehensive dataset from GitHub Archive Issues (2015-2024), supplemented with industrial data validation. We demonstrate that in-project fine-tuned transformer models significantly outperform task-specific fine-tuned models in TD classification, highlighting the importance of project-specific context in accurate TD identification. Our research also reveals the superiority of specialized binary classifiers over multi-class models for TD and its type identification, enabling more targeted debt resolution strategies. A comparative analysis shows that the smaller DistilRoBERTa model is more effective than larger language models like GPTs for TD classification tasks, especially after fine-tuning, offering insights into efficient model selection for specific TD detection tasks. The study also assesses generalization capabilities using metrics such as MCC, AUC ROC, Recall, and F1 score, focusing on model effectiveness, fine-tuning impact, and relative performance. By validating our approach on out-of-distribution and real-world industrial datasets, we ensure practical applicability, addressing the diverse nature of software projects.

SEApr 15, 2025
TD-Suite: All Batteries Included Framework for Technical Debt Classification

Karthik Shivashankar, Antonio Martini

Recognizing that technical debt is a persistent and significant challenge requiring sophisticated management tools, TD-Suite offers a comprehensive software framework specifically engineered to automate the complex task of its classification within software projects. It leverages the advanced natural language understanding of state-of-the-art transformer models to analyze textual artifacts, such as developer discussions in issue reports, where subtle indicators of debt often lie hidden. TD-Suite provides a seamless end-to-end pipeline, managing everything from initial data ingestion and rigorous preprocessing to model training, thorough evaluation, and final inference. This allows it to support both straightforward binary classification (debt or no debt) and more valuable, identifying specific categories like code, design, or documentation debt, thus enabling more targeted management strategies. To ensure the generated models are robust and perform reliably on real-world, often imbalanced, datasets, TD-Suite incorporates critical training methodologies: k-fold cross-validation assesses generalization capability, early stopping mechanisms prevent overfitting to the training data, and class weighting strategies effectively address skewed data distributions. Beyond core functionality, and acknowledging the growing importance of sustainability, the framework integrates tracking and reporting of carbon emissions associated with the computationally intensive model training process. It also features a user-friendly Gradio web interface in a Docker container setup, simplifying model interaction, evaluation, and inference.

SEApr 15, 2025
Scalability and Maintainability Challenges and Solutions in Machine Learning: Systematic Literature Review

Karthik Shivashankar, Ghadi S. Al Hajj, Antonio Martini

This systematic literature review examines the critical challenges and solutions related to scalability and maintainability in Machine Learning (ML) systems. As ML applications become increasingly complex and widespread across industries, the need to balance system scalability with long-term maintainability has emerged as a significant concern. This review synthesizes current research and practices addressing these dual challenges across the entire ML life-cycle, from data engineering to model deployment in production. We analyzed 124 papers to identify and categorize 41 maintainability challenges and 13 scalability challenges, along with their corresponding solutions. Our findings reveal intricate inter dependencies between scalability and maintainability, where improvements in one often impact the other. The review is structured around six primary research questions, examining maintainability and scalability challenges in data engineering, model engineering, and ML system development. We explore how these challenges manifest differently across various stages of the ML life-cycle. This comprehensive overview offers valuable insights for both researchers and practitioners in the field of ML systems. It aims to guide future research directions, inform best practices, and contribute to the development of more robust, efficient, and sustainable ML applications across various domains.

SEJan 30, 2025
MLScent A tool for Anti-pattern detection in ML projects

Karthik Shivashankar, Antonio Martini

Machine learning (ML) codebases face unprecedented challenges in maintaining code quality and sustainability as their complexity grows exponentially. While traditional code smell detection tools exist, they fail to address ML-specific issues that can significantly impact model performance, reproducibility, and maintainability. This paper introduces MLScent, a novel static analysis tool that leverages sophisticated Abstract Syntax Tree (AST) analysis to detect anti-patterns and code smells specific to ML projects. MLScent implements 76 distinct detectors across major ML frameworks including TensorFlow (13 detectors), PyTorch (12 detectors), Scikit-learn (9 detectors), and Hugging Face (10 detectors), along with data science libraries like Pandas and NumPy (8 detectors each). The tool's architecture also integrates general ML smell detection (16 detectors), and specialized analysis for data preprocessing and model training workflows. Our evaluation demonstrates MLScent's effectiveness through both quantitative classification metrics and qualitative assessment via user studies feedback with ML practitioners. Results show high accuracy in identifying framework-specific anti-patterns, data handling issues, and general ML code smells across real-world projects.

SEApr 16, 2025
Unravelling Technical debt topics through Time, Programming Languages and Repository

Karthik Shivashankar, Antonio Martini

This study explores the dynamic landscape of Technical Debt (TD) topics in software engineering by examining its evolution across time, programming languages, and repositories. Despite the extensive research on identifying and quantifying TD, there remains a significant gap in understanding the diversity of TD topics and their temporal development. To address this, we have conducted an explorative analysis of TD data extracted from GitHub issues spanning from 2015 to September 2023. We employed BERTopic for sophisticated topic modelling. This study categorises the TD topics and tracks their progression over time. Furthermore, we have incorporated sentiment analysis for each identified topic, providing a deeper insight into the perceptions and attitudes associated with these topics. This offers a more nuanced understanding of the trends and shifts in TD topics through time, programming language, and repository.

SEApr 15, 2025
QualiTagger: Automating software quality detection in issue trackers

Karthik Shivashankar, Rafael Capilla, Maren Maritsdatter Kruke et al.

A systems quality is a major concern for development teams when it evolve. Understanding the effects of a loss of quality in the codebase is crucial to avoid side effects like the appearance of technical debt. Although the identification of these qualities in software requirements described in natural language has been investigated, most of the results are often not applicable in practice, and rely on having been validated on small datasets and limited amount of projects. For many years, machine learning (ML) techniques have been proved as a valid technique to identify and tag terms described in natural language. In order to advance previous works, in this research we use cutting edge models like Transformers, together with a vast dataset mined and curated from GitHub, to identify what text is usually associated with different quality properties. We also study the distribution of such qualities in issue trackers from openly accessible software repositories, and we evaluate our approach both with students from a software engineering course and with its application to recognize security labels in industry.

SEAug 1, 2021
Agile Elicitation of Scalability Requirements for Open Systems: A Case Study

Gunnar Brataas, Antonio Martini, Geir Kjetil Hanssen et al.

Eliciting scalability requirements during agile software development is complicated and poorly described in previous research. This article presents a lightweight artifact for eliciting scalability requirements during agile software development: the ScrumScale model. The ScrumScale model is a simple spreadsheet. The scalability concepts underlying the ScrumScale model are clarified in this design science research, which also utilizes coordination theory. This paper describes the open banking case study, where a legacy banking system becomes open. This challenges the scalability of this legacy system. The first step in understanding this challenge is to elicit the new scalability requirements. In the open banking case study, key stakeholders from TietoEVRY spent 55 hours eliciting TietoEVRY's open banking project's scalability requirements. According to TietoEVRY, the ScrumScale model provided a systematic way of producing scalability requirements. For TietoEVRY, the scalability concepts behind the ScrumScale model also offered significant advantages in dialogues with other stakeholders.

SEJan 5, 2021
The use of incentives to promote Technical Debt management

Terese Besker, Antonio Martini, Jan Bosch

When developing software, it is vitally important to keep the level of technical debt down since it is well established from several studies that technical debt can, e.g., lower the development productivity, decrease the developers' morale, and compromise the overall quality of the software. However, even if researchers and practitioners working in today's software development industry are quite familiar with the concept of technical debt and its related negative consequences, there has been no empirical research focusing specifically on how software managers actively communicate and manage the need to keep the level of technical debt as low as possible.

SESep 22, 2020
Measuring affective states from technical debt: A psychoempirical software engineering experiment

Jesper Olsson, Erik Risfelt, Terese Besker et al.

Software engineering is a human activity. Despite this, human aspects are under-represented in technical debt research, perhaps because they are challenging to evaluate. This study's objective was to investigate the relationship between technical debt and affective states (feelings, emotions, and moods) from software practitioners. Forty participants (N = 40) from twelve companies took part in a mixed-methods approach, consisting of a repeated-measures (r = 5) experiment (n = 200), a survey, and semi-structured interviews. The statistical analysis shows that different design smells (strong indicators of technical debt) negatively or positively impact affective states. From the qualitative data, it is clear that technical debt activates a substantial portion of the emotional spectrum and is psychologically taxing. Further, the practitioners' reactions to technical debt appear to fall in different levels of maturity. We argue that human aspects in technical debt are important factors to consider, as they may result in, e.g., procrastination, apprehension, and burnout.

SEAug 2, 2019
Towards Surgically-Precise Technical Debt Estimation: Early Results and Research Roadmap

Valentina Lenarduzzi, Antonio Martini, Davide Taibi et al.

The concept of technical debt has been explored from many perspectives but its precise estimation is still under heavy empirical and experimental inquiry. We aim to understand whether, by harnessing approximate, data-driven, machine-learning approaches it is possible to improve the current techniques for technical debt estimation, as represented by a top industry quality analysis tool such as SonarQube. For the sake of simplicity, we focus on relatively simple regression modelling techniques and apply them to modelling the additional project cost connected to the sub-optimal conditions existing in the projects under study. Our results shows that current techniques can be improved towards a more precise estimation of technical debt and the case study shows promising results towards the identification of more accurate estimation of technical debt.

SEApr 29, 2019
Technical Debt Prioritization: State of the Art. A Systematic Literature Review

Valentina Lenarduzzi, Terese Besker, Davide Taibi et al.

Background. Software companies need to manage and refactor Technical Debt issues. Therefore, it is necessary to understand if and when refactoring Technical Debt should be prioritized with respect to developing features or fixing bugs. Objective. The goal of this study is to investigate the existing body of knowledge in software engineering to understand what Technical Debt prioritization approaches have been proposed in research and industry. Method. We conducted a Systematic Literature Review among 384 unique papers published until 2018, following a consolidated methodology applied in Software Engineering. We included 38 primary studies. Results. Different approaches have been proposed for Technical Debt prioritization, all having different goals and optimizing on different criteria. The proposed measures capture only a small part of the plethora of factors used to prioritize Technical Debt qualitatively in practice. We report an impact map of such factors. However, there is a lack of empirical and validated set of tools. Conclusion. We observed that technical Debt prioritization research is preliminary and there is no consensus on what are the important factors and how to measure them. Consequently, we cannot consider current research conclusive and in this paper, we outline different directions for necessary future investigations.