Gemma Catolino

SE
h-index18
9papers
15citations
Novelty39%
AI Score45

9 Papers

56.2SEApr 7Code
Bias Ahead: Sensitive Prompts as Early Warnings for Fairness in Large Language Models

Gianmario Voria, Martina De Lucia, Alessandra Raia et al.

Large Language Models (LLMs) are being increasingly integrated into software systems, offering powerful capabilities but also raising concerns about fairness. Existing fairness benchmarks, however, focus on stereotype-specific associations, which limit their ability to anticipate risks in diverse, real-world contexts. In this paper, we propose sensitive prompts as a new abstraction for fairness evaluation: inputs that are not inherently biased but are more likely to elicit biased or inadequate responses due to the sensitivity of their content. We construct and release SensY, a dataset of 12,801 prompts, categorized as sensitive and non-sensitive, spanning seven thematic domains, combining synthetic generation and real user inputs. Using this dataset, we query three open-source LLMs and manually analyze 4,500 responses to evaluate their adequacy in answering sensitive prompts. Results show that while models often provide factually correct answers, they frequently fail to acknowledge the ethical, relational, or contextual implications of sensitive inputs. In addition, we develop an automated classifier for predicting prompt sensitivity, achieving robust performance on sensitive prompts. Our findings demonstrate that prompt sensitivity can serve as an effective early-warning mechanism for fairness risks in LLMs. This perspective shifts fairness assessment from reactive mitigation toward preventive design, enabling developers to anticipate and manage bias before deployment.

37.8SEApr 7
SCOPE: A Dataset of Stereotyped Prompts for Counterfactual Fairness Assessment of LLMs

Alessandra Parziale, Gianmario Voria, Valeria Pontillo et al.

Large Language Models (LLMs) now serve as the foundation for a wide range of applications, from conversational assistants to decision support tools, making the issue of fairness in their results increasingly important. Previous studies have shown that LLM outputs can shift when prompts reference different demographic groups, even when intent and semantic content remain constant. However, existing resources for probing such disparities rely primarily on small, template-based counterfactual examples or fixed sentence pairs. These benchmarks offer limited linguistic diversity, narrow topical coverage, and little support for analyzing how communicative intent affects model behavior. To address these limitations, we introduce SCOPE (Stereotype-COnditioned Prompts for Evaluation), a large-scale dataset of counterfactual prompt pairs designed to enable systematic investigation of group-sensitive behavior in LLMs. SCOPE contains 241,280 prompts organized into 120,640 counterfactual pairs, each grounded in one of 1,438 topics and spanning nine bias dimensions and 1,536 demographic groups. All prompts are generated under four distinct communicative intents: Question, Recommendation, Direction, and Clarification, ensuring broad coverage of common interaction styles. This resource provides a controlled, semantically aligned, and intent-aware basis for evaluating fairness, robustness, and counterfactual consistency.

SEAug 29, 2024
A Catalog of Fairness-Aware Practices in Machine Learning Engineering

Gianmario Voria, Giulia Sellitto, Carmine Ferrara et al.

Machine learning's widespread adoption in decision-making processes raises concerns about fairness, particularly regarding the treatment of sensitive features and potential discrimination against minorities. The software engineering community has responded by developing fairness-oriented metrics, empirical studies, and approaches. However, there remains a gap in understanding and categorizing practices for engineering fairness throughout the machine learning lifecycle. This paper presents a novel catalog of practices for addressing fairness in machine learning derived from a systematic mapping study. The study identifies and categorizes 28 practices from existing literature, mapping them onto different stages of the machine learning lifecycle. From this catalog, the authors extract actionable items and implications for both researchers and practitioners in software engineering. This work aims to provide a comprehensive resource for integrating fairness considerations into the development and deployment of machine learning systems, enhancing their reliability, accountability, and credibility.

SEJan 9
Tracing Stereotypes in Pre-trained Transformers: From Biased Neurons to Fairer Models

Gianmario Voria, Moses Openja, Foutse Khomh et al.

The advent of transformer-based language models has reshaped how AI systems process and generate text. In software engineering (SE), these models now support diverse activities, accelerating automation and decision-making. Yet, evidence shows that these models can reproduce or amplify social biases, raising fairness concerns. Recent work on neuron editing has shown that internal activations in pre-trained transformers can be traced and modified to alter model behavior. Building on the concept of knowledge neurons, neurons that encode factual information, we hypothesize the existence of biased neurons that capture stereotypical associations within pre-trained transformers. To test this hypothesis, we build a dataset of biased relations, i.e., triplets encoding stereotypes across nine bias types, and adapt neuron attribution strategies to trace and suppress biased neurons in BERT models. We then assess the impact of suppression on SE tasks. Our findings show that biased knowledge is localized within small neuron subsets, and suppressing them substantially reduces bias with minimal performance loss. This demonstrates that bias in transformers can be traced and mitigated at the neuron level, offering an interpretable approach to fairness in SE.

67.7SEApr 14
Exploring Individual Factors in the Adoption of LLMs for Specific Software Engineering Purposes

Stefano Lambiase, Gemma Catolino, Fabio Palomba et al.

Context: The advent of Large Language Models (LLMs) is transforming software development, significantly enhancing software engineering (SE) processes. Research has explored their role within development teams, focusing on the specific purposes for which LLMs are used within SE tasks, such as artifact generation, decision-making support, and information retrieval. Despite the growing body of work on LLMs in SE, most studies have centered on broad adoption trends, neglecting the nuanced relationship between individual cognitive and behavioral factors and their impact on purpose-specific adoption. While factors such as perceived effort and performance expectancy have been explored at a general level, their influence on distinct SE purposes remains underexamined. This gap hinders the development of tailored LLM-based systems (e.g., Generative AI Agents) that align with engineers' specific needs and limits the ability of team leaders to devise effective strategies for fostering LLM adoption in targeted workflows. Objectives: For the reasons mentioned above, this study aims to study the individual factors that drive the choice to use LLMs for distinct SE purposes. Methods: To achieve the above-mentioned objective, we surveyed 188 software engineers to test the relationship between individual attributes related to technology adoption and LLM adoption across five key purposes, using structural equation modeling (SEM). The Unified Theory of Acceptance and Use of Technology (UTAUT2) was applied to characterize individual adoption behaviors. Results: The findings reveal that purpose-specific adoption is influenced by distinct factors, some of which negatively impact adoption when considered in isolation, underscoring the complexity of LLM integration in SE.

SEDec 18, 2024
From Expectation to Habit: Why Do Software Practitioners Adopt Fairness Toolkits?

Gianmario Voria, Stefano Lambiase, Maria Concetta Schiavone et al.

As the adoption of machine learning (ML) systems continues to grow across industries, concerns about fairness and bias in these systems have taken center stage. Fairness toolkits, designed to mitigate bias in ML models, serve as critical tools for addressing these ethical concerns. However, their adoption in the context of software development remains underexplored, especially regarding the cognitive and behavioral factors driving their usage. As a deeper understanding of these factors could be pivotal in refining tool designs and promoting broader adoption, this study investigates the factors influencing the adoption of fairness toolkits from an individual perspective. Guided by the Unified Theory of Acceptance and Use of Technology (UTAUT2), we examined the factors shaping the intention to adopt and actual use of fairness toolkits. Specifically, we employed Partial Least Squares Structural Equation Modeling (PLS-SEM) to analyze data from a survey study involving practitioners in the software industry. Our findings reveal that performance expectancy and habit are the primary drivers of fairness toolkit adoption. These insights suggest that by emphasizing the effectiveness of these tools in mitigating bias and fostering habitual use, organizations can encourage wider adoption. Practical recommendations include improving toolkit usability, integrating bias mitigation processes into routine development workflows, and providing ongoing support to ensure professionals see clear benefits from regular use.

SEDec 20, 2024
Data Preparation for Fairness-Performance Trade-Offs: A Practitioner-Friendly Alternative?

Gianmario Voria, Rebecca Di Matteo, Giammaria Giordano et al.

As machine learning (ML) systems are increasingly adopted across industries, addressing fairness and bias has become essential. While many solutions focus on ethical challenges in ML, recent studies highlight that data itself is a major source of bias. Pre-processing techniques, which mitigate bias before training, are effective but may impact model performance and pose integration difficulties. In contrast, fairness-aware Data Preparation practices are both familiar to practitioners and easier to implement, providing a more accessible approach to reducing bias. Objective. This registered report proposes an empirical evaluation of how optimally selected fairness-aware practices, applied in early ML lifecycle stages, can enhance both fairness and performance, potentially outperforming standard pre-processing bias mitigation methods. Method. To this end, we will introduce FATE, an optimization technique for selecting 'Data Preparation' pipelines that optimize fairness and performance. Using FATE, we will analyze the fairness-performance trade-off, comparing pipelines selected by FATE with results by pre-processing bias mitigation techniques.

SEJul 25, 2019
Not All Bugs Are the Same: Understanding, Characterizing, and Classifying the Root Cause of Bugs

Gemma Catolino, Fabio Palomba, Andy Zaidman et al.

Modern version control systems such as Git or SVN include bug tracking mechanisms, through which developers can highlight the presence of bugs through bug reports, i.e., textual descriptions reporting the problem and what are the steps that led to a failure. In past and recent years, the research community deeply investigated methods for easing bug triage, that is, the process of assigning the fixing of a reported bug to the most qualified developer. Nevertheless, only a few studies have reported on how to support developers in the process of understanding the type of a reported bug, which is the first and most time-consuming step to perform before assigning a bug-fix operation. In this paper, we target this problem in two ways: first, we analyze 1,280 bug reports of 119 popular projects belonging to three ecosystems such as Mozilla, Apache, and Eclipse, with the aim of building a taxonomy of the root causes of reported bugs; then, we devise and evaluate an automated classification model able to classify reported bugs according to the defined taxonomy. As a result, we found nine main common root causes of bugs over the considered systems. Moreover, our model achieves high F-Measure and AUC-ROC (64% and 74% on overall, respectively).

SEMay 26, 2019
Improving Change Prediction Models with Code Smell-Related Information

Gemma Catolino, Fabio Palomba, Francesca Arcelli Fontana et al.

Code smells represent sub-optimal implementation choices applied by developers when evolving software systems. The negative impact of code smells has been widely investigated in the past: besides developers' productivity and ability to comprehend source code, researchers empirically showed that the presence of code smells heavily impacts the change-proneness of the affected classes. On the basis of these findings, in this paper we conjecture that code smell-related information can be effectively exploited to improve the performance of change prediction models, ie models having as goal that of indicating to developers which classes are more likely to change in the future, so that they may apply preventive maintenance actions. Specifically, we exploit the so-called intensity index - a previously defined metric that captures the severity of a code smell - and evaluate its contribution when added as additional feature in the context of three state of the art change prediction models based on product, process, and developer-based features. We also compare the performance achieved by the proposed model with the one of an alternative technique that considers the previously defined antipattern metrics, namely a set of indicators computed considering the history of code smells in files. Our results report that (i) the prediction performance of the intensity-including models is statistically better than that of the baselines and (ii) the intensity is a more powerful metric with respect to the alternative smell-related ones.