SEApr 24
Enhancing a gamified tool for UML modeling educationGiacomo Garaccione, Riccardo Coppola, Luca Ardito
Unified Modeling Language (UML) Use Case and Class Diagrams are fundamental modeling notations in Software Engineering (SE) education due to their importance for requirements and model-based engineering, yet their relevance is underestimated by students, who tend to dismiss the topic as secondary. Gamification has been adopted to make modeling education more appealing, but existing tools focus almost exclusively on class diagrams, leaving support for use cases and other notations unexplored. In 2025, we designed UMLegend, a gamified tool for class diagrams that offered dynamic feedback to help students learn correct modeling practices and multiple long-term mechanics to increase engagement, and performed a study with the tool. With this paper, we describe how we enhanced UMLegend following the results of the experiment so that it can support more modeling languages, with use case diagrams being added to the type of available exercises in the tool. The revised version has been refactored to have a modular architecture, to make it easier to add other software engineering topics and additional modeling notations. We also describe the potential impact we expect the new version to have, and outline a longitudinal study we intend to perform in 2026 where we will assess whether long-term UML gamification leads to improved student performance.
SEApr 24
Evaluating LLM-Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their LimitationsAnna Arnaudo, Riccardo Coppola, Maurizio Morisio et al.
Due to the textual and repetitive nature of many Requirements Engineering (RE) artefacts, Large Language Models (LLMs) have proven useful to automate their generation and processing. In this paper, we discuss a possible approach for automating the Goal-Oriented Requirements Engineering (GORE) process by extracting functional goals from software documentation through three phases: actor identification, high and low-level goal extraction. To implement these functionalities, we propose a chain of LLMs fed with engineered prompts. We experimented with different variants of in-context learning and measured the similarities between input data and in-context examples to better investigate their impact. Another key element is the generation-critic mechanism, implemented as a feedback loop involving two LLMs. Although the pipeline achieved 61% accuracy in low-level goal identification, the final stage, these results indicate the approach is best suited as a tool to accelerate manual extraction rather than as a full replacement. The feedback-loop mechanism with Zero-shot outperformed stand-alone Few-shot, with an ablation study suggesting that performance slightly degrades without the feedback cycle. However, we reported that the combination of the feedback mechanism with Few-shot does not deliver any advantage, possibly suggesting that the primary performance ceiling is the prompting strategy applied to the 'critic' LLM. Together with the refinement of both the quantity and quality of the Shot examples, future research will integrate Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) prompting to improve accuracy.
AIMar 19
Analysis Of Linguistic Stereotypes in Single and Multi-Agent Generative AI ArchitecturesMartina Ullasci, Marco Rondina, Riccardo Coppola et al.
Many works in the literature show that LLM outputs exhibit discriminatory behaviour, triggering stereotype-based inferences based on the dialect in which the inputs are written. This bias has been shown to be particularly pronounced when the same inputs are provided to LLMs in Standard American English (SAE) and African-American English (AAE). In this paper, we replicate existing analyses of dialect-sensitive stereotype generation in LLM outputs and investigate the effects of mitigation strategies, including prompt engineering (role-based and Chain-Of-Thought prompting) and multi-agent architectures composed of generate-critique-revise models. We define eight prompt templates to analyse different ways in which dialect bias can manifest, such as suggested names, jobs, and adjectives for SAE or AAE speakers. We use an LLM-as-judge approach to evaluate the bias in the results. Our results show that stereotype-bearing differences emerge between SAE- and AAE-related outputs across all template categories, with the strongest effects observed in adjective and job attribution. Baseline disparities vary substantially by model, with the largest SAE-AAE differential observed in Claude Haiku and the smallest in Phi-4 Mini. Chain-Of-Thought prompting proved to be an effective mitigation strategy for Claude Haiku, whereas the use of a multi-agent architecture ensured consistent mitigation across all the models. These findings suggest that for intersectionality-informed software engineering, fairness evaluation should include model-specific validation of mitigation strategies, and workflow-level controls (e.g., agentic architectures involving critique models) in high-impact LLM deployments. The current results are exploratory in nature and limited in scope, but can lead to extensions and replications by increasing the dataset size and applying the procedure to different languages or dialects.
SEAug 18, 2019Code
Characterizing the transition to Kotlin of Android apps: a study on F-Droid, Play Store and GitHubRiccardo Coppola, Luca Ardito, Marco Torchiano
Kotlin is a novel language that represents an alternative to Java, and has been recently adopted as a first-class programming language for Android applications. Kotlin is achieving a significant diffusion among developers, and several studies have highlighted various advantages of the language when compared to Java. The objective of this paper is to analyze a set of open-source Android apps, to evaluate their transition to the Kotlin programming language throughout their lifespan and understand whether the adoption of Kotlin has impacts on the success of Android apps. We mined all the projects from the F-Droid repository of Android open-source applications, and we found the corresponding projects on the official Google Play Store and on the GitHub platform. We defined a set of eight metrics to quantify the relevance of Kotlin code in the latest update and through all releases of an application. Then, we statistically analyzed the correlation between the presence of Kotlin code in a project and popularity metrics mined from the platforms where the apps were released. Of a set of 1232 projects that were updated after October 2017, near 20% adopted Kotlin and about 12% had more Kotlin code than Java; most of the projects that adopted Kotlin quickly transitioned from Java to the new language. The projects featuring Kotlin had on average higher popularity metrics; a statistically significant correlation has been found between the presence of Kotlin and the number of stars on the GitHub repository. The Kotlin language seems able to guarantee a seamless migration from Java for Android developers. With an inspection on a large set of open-source Android apps, we observed that the adoption of the Kotlin language is rapid (when compared to the average lifespan of an Android project) and seems to come at no cost in terms of popularity among the users and other developers.
SENov 9, 2017Code
Scripted GUI Testing of Android Apps: A Study on Diffusion, Evolution and FragilityRiccardo Coppola, Maurizio Morisio, Marco Torchiano
Background. Evidence suggests that mobile applications are not thoroughly tested as their desktop counterparts. In particular GUI testing is generally limited. Like web-based applications, mobile apps suffer from GUI test fragility, i.e. GUI test classes failing due to minor modifications in the GUI, without the application functionalities being altered. Aims. The objective of our study is to examine the diffusion of GUI testing on Android, and the amount of changes required to keep test classes up to date, and in particular the changes due to GUI test fragility. We define metrics to characterize the modifications and evolution of test classes and test methods, and proxies to estimate fragility-induced changes. Method. To perform our experiments, we selected six widely used open-source tools for scripted GUI testing of mobile applications previously described in the literature. We have mined the repositories on GitHub that used those tools, and computed our set of metrics. Results. We found that none of the considered GUI testing frameworks achieved a major diffusion among the open-source Android projects available on GitHub. For projects with GUI tests, we found that test suites have to be modified often, specifically 5\%-10\% of developers' modified LOCs belong to tests, and that a relevant portion (60\% on average) of such modifications are induced by fragility. Conclusions. Fragility of GUI test classes constitute a relevant concern, possibly being an obstacle for developers to adopt automated scripted GUI tests. This first evaluation and measure of fragility of Android scripted GUI testing can constitute a benchmark for developers, and the basis for the definition of a taxonomy of fragility causes, and actionable guidelines to mitigate the issue.
AIMar 12
Gender Bias in Generative AI-assisted Recruitment ProcessesMartina Ullasci, Marco Rondina, Riccardo Coppola et al.
In recent years, generative artificial intelligence (GenAI) systems have assumed increasingly crucial roles in selection processes, personnel recruitment and analysis of candidates' profiles. However, the employment of large language models (LLMs) risks reproducing, and in some cases amplifying, gender stereotypes and bias already present in the labour market. The objective of this paper is to evaluate and measure this phenomenon, analysing how a state-of-the-art generative model (GPT-5) suggests occupations based on gender and work experience background, focusing on under-35-year-old Italian graduates. The model has been prompted to suggest jobs to 24 simulated candidate profiles, which are balanced in terms of gender, age, experience and professional field. Although no significant differences emerged in job titles and industry, gendered linguistic patterns emerged in the adjectives attributed to female and male candidates, indicating a tendency of the model to associate women with emotional and empathetic traits, while men with strategic and analytical ones. The research raises an ethical question regarding the use of these models in sensitive processes, highlighting the need for transparency and fairness in future digital labour markets.
CLJul 25, 2025
An Empirical Investigation of Gender Stereotype Representation in Large Language Models: The Italian CaseGioele Giachino, Marco Rondina, Antonio Vetrò et al.
The increasing use of Large Language Models (LLMs) in a large variety of domains has sparked worries about how easily they can perpetuate stereotypes and contribute to the generation of biased content. With a focus on gender and professional bias, this work examines in which manner LLMs shape responses to ungendered prompts, contributing to biased outputs. This analysis uses a structured experimental method, giving different prompts involving three different professional job combinations, which are also characterized by a hierarchical relationship. This study uses Italian, a language with extensive grammatical gender differences, to highlight potential limitations in current LLMs' ability to generate objective text in non-English languages. Two popular LLM-based chatbots are examined, namely OpenAI ChatGPT (gpt-4o-mini) and Google Gemini (gemini-1.5-flash). Through APIs, we collected a range of 3600 responses. The results highlight how content generated by LLMs can perpetuate stereotypes. For example, Gemini associated 100% (ChatGPT 97%) of 'she' pronouns to the 'assistant' rather than the 'manager'. The presence of bias in AI-generated text can have significant implications in many fields, such as in the workplaces or in job selections, raising ethical concerns about its use. Understanding these risks is pivotal to developing mitigation strategies and assuring that AI-based systems do not increase social inequalities, but rather contribute to more equitable outcomes. Future research directions include expanding the study to additional chatbots or languages, refining prompt engineering methods or further exploiting a larger experimental base.
SEFeb 10, 2025
Testing software for non-discrimination: an updated and extended audit in the Italian car insurance domainMarco Rondina, Antonio Vetrò, Riccardo Coppola et al.
Context. As software systems become more integrated into society's infrastructure, the responsibility of software professionals to ensure compliance with various non-functional requirements increases. These requirements include security, safety, privacy, and, increasingly, non-discrimination. Motivation. Fairness in pricing algorithms grants equitable access to basic services without discriminating on the basis of protected attributes. Method. We replicate a previous empirical study that used black box testing to audit pricing algorithms used by Italian car insurance companies, accessible through a popular online system. With respect to the previous study, we enlarged the number of tests and the number of demographic variables under analysis. Results. Our work confirms and extends previous findings, highlighting the problematic permanence of discrimination across time: demographic variables significantly impact pricing to this day, with birthplace remaining the main discriminatory factor against individuals not born in Italian cities. We also found that driver profiles can determine the number of quotes available to the user, denying equal opportunities to all. Conclusion. The study underscores the importance of testing for non-discrimination in software systems that affect people's everyday lives. Performing algorithmic audits over time makes it possible to evaluate the evolution of such algorithms. It also demonstrates the role that empirical software engineering can play in making software systems more accountable.
HCJun 25, 2020
Mood-based On-Car Music RecommendationsErion Çano, Riccardo Coppola, Eleonora Gargiulo et al.
Driving and music listening are two inseparable everyday activities for millions of people today in the world. Considering the high correlation between music, mood and driving comfort and safety, it makes sense to use appropriate and intelligent music recommendations based on the mood of drivers and songs in the context of car driving. The objective of this paper is to present the project of a contextual mood-based music recommender system capable of regulating the driver's mood and trying to have a positive influence on her driving behaviour. Here we present the proof of concept of the system and describe the techniques and technologies that are part of it. Further possible future improvements on each of the building blocks are also presented.
SEJul 18, 2019
Fragility of Layout-Based and Visual GUI Test Scripts: An Assessment Study on a Hybrid Mobile ApplicationRiccardo Coppola, Luca Ardito, Marco Torchiano
Context: Albeit different approaches exist for automated GUI testing of hybrid mobile applications, the practice appears to be not so commonly adopted by developers. A possible reason for such a low diffusion can be the fragility of the techniques, i.e. the frequent need for maintaining test cases when the GUI of the app is changed. Goal: In this paper, we perform an assessment of the maintenance needed by test cases for a hybrid mobile app, and the related fragility causes. Methods: We evaluated a small test suite with a Layout-based testing tool (Appium) and a Visual one (EyeAutomate) and observed the changes needed by tests during the co-evolution with the GUI of the app. Results: We found that 20% Layout-based test methods and 30% Visual test methods had to be modified at least once, and that each release induced fragilities in 3-4% of the test methods. Conclusion: Fragility of GUI tests can induce relevant maintenance efforts in test suites of large applications. Several principal causes for fragilities have been identified for the tested hybrid application, and guidelines for developers are deduced from them.