SEJun 3
How Software Engineering Students Use LLMs to Write Research Papers: An Experience ReportRonnie de Souza Santos, Maria Teresa Baldassarre, Cleyton Magalhaes et al.
Large language models are increasingly becoming part of software engineering education, including activities involving empirical software engineering and evidence synthesis. This paper reports an educational experience involving the integration of reflective LLM use into an empirical methods assignment in a third-year software architecture course. Students were asked to develop a short research paper using either a rapid review or a gray literature review methodology and to disclose how LLMs were used throughout the assignment. We analyzed 146 student disclosure statements using a cross-analysis process combining LLM-assisted categorization with manual verification and refinement by the researchers. The reflections describe how students incorporated LLMs during activities such as brainstorming, methodological clarification, organization of findings, and writing refinement, while also reporting concerns regarding inaccuracies and verification of generated content. This experience report discusses lessons learned and educational implications for integrating AI-assisted technologies into empirical software engineering education.
CYJun 1
Fairness Definitions and Metrics in Deep Reinforcement Learning for Drug Discovery in Healthcare: A Rapid Evidence ReviewEsmaeil Shakeri, Ronnie de Souza Santos, Behrouz Far
Deep reinforcement learning (DRL) is increasingly applied to de novo molecular design, but choices in data, rewards, and evaluation can yield uneven performance across disease areas and chemotypes. Despite this, there is no concise synthesis of how fairness is defined, measured, and tested in DRL-based drug discovery. In this rapid evidence review, we synthesize fairness definitions and metrics for DRL-driven molecule generation in healthcare. We focus on three questions: (i) how dataset composition and split strategies, especially scaffold versus random splits, affect evaluation and distribution shift; (ii) how reward design (e.g., QED, docking, toxicity, synthetic accessibility) can create or mitigate bias, with emphasis on cancer targets; and (iii) which measurable metrics best capture fairness. This includes parity across cancer versus non-cancer indications and across cancer subtypes. It also includes distributional balance in key physicochemical descriptors, scaffold/chemotype diversity, groupwise validity, toxicity, and synthetic accessibility. From 2017 onward, we searched major biomedical, computer science, and engineering literature databases and used arXiv for horizon scanning. Records were screened using PRISMA-style procedures and analyzed via content coding to link reported parity outcomes to dataset and reward choices. Our review provides a concise set of fairness definitions and metrics for DRL molecule generation. It offers practical guidance for reporting distribution parity and outcome parity. It also summarizes how dataset and reward choices relate to observed parity effects and identifies open gaps relevant to trustworthy, cancer-relevant DRL generation.
AIAug 28, 2024
Trustworthy and Responsible AI for Human-Centric Autonomous Decision-Making SystemsFarzaneh Dehghani, Mahsa Dibaji, Fahim Anzum et al.
Artificial Intelligence (AI) has paved the way for revolutionary decision-making processes, which if harnessed appropriately, can contribute to advancements in various sectors, from healthcare to economics. However, its black box nature presents significant ethical challenges related to bias and transparency. AI applications are hugely impacted by biases, presenting inconsistent and unreliable findings, leading to significant costs and consequences, highlighting and perpetuating inequalities and unequal access to resources. Hence, developing safe, reliable, ethical, and Trustworthy AI systems is essential. Our team of researchers working with Trustworthy and Responsible AI, part of the Transdisciplinary Scholarship Initiative within the University of Calgary, conducts research on Trustworthy and Responsible AI, including fairness, bias mitigation, reproducibility, generalization, interpretability, and authenticity. In this paper, we review and discuss the intricacies of AI biases, definitions, methods of detection and mitigation, and metrics for evaluating bias. We also discuss open challenges with regard to the trustworthiness and widespread application of AI across diverse domains of human-centric decision making, as well as guidelines to foster Responsible and Trustworthy AI models.
SEJun 27, 2023
The Perspective of Software Professionals on Algorithmic RacismRonnie de Souza Santos, Luiz Fernando de Lima, Cleyton Magalhaes
Context. Algorithmic racism is the term used to describe the behavior of technological solutions that constrains users based on their ethnicity. Lately, various data-driven software systems have been reported to discriminate against Black people, either for the use of biased data sets or due to the prejudice propagated by software professionals in their code. As a result, Black people are experiencing disadvantages in accessing technology-based services, such as housing, banking, and law enforcement. Goal. This study aims to explore algorithmic racism from the perspective of software professionals. Method. A survey questionnaire was applied to explore the understanding of software practitioners on algorithmic racism, and data analysis was conducted using descriptive statistics and coding techniques. Results. We obtained answers from a sample of 73 software professionals discussing their understanding and perspectives on algorithmic racism in software development. Our results demonstrate that the effects of algorithmic racism are well-known among practitioners. However, there is no consensus on how the problem can be effectively addressed in software engineering. In this paper, some solutions to the problem are proposed based on the professionals' narratives. Conclusion. Combining technical and social strategies, including training on structural racism for software professionals, is the most promising way to address the algorithmic racism problem and its effects on the software solutions delivered to our society.
CYMar 19
LLM Use, Cheating, and Academic Integrity in Software Engineering EducationRonnie de Souza Santos, Italo Santos, Mariana Bento et al.
Background: Cheating in university education is commonly described as context dependent and influenced by assessment design, institutional norms, and student interpretation. In software engineering education, programming oriented coursework has historically involved ambiguity around collaboration, reuse, and external assistance. Recently, large language models (LLMs) have introduced additional mediation in the production of code and related artifacts. Aims: This study investigates how software engineering students describe experiences of using LLMs in ways they perceived as inappropriate, disallowed, or misaligned with course expectations. Method: A cross sectional survey was conducted with 116 undergraduate software engineering students from multiple countries, combining quantitative summaries with qualitative data. Results: Reported LLM cheating practices occurred primarily in programming assignments, routine coursework, and documentation tasks, often in contexts of time pressure and unclear guidance. Use during quizzes and exams was less frequent and more consistently identified as a violation. Students reported awareness of academic and professional consequences regarding LLM cheating, while formal sanctions were perceived as limited. Conclusions: Our study indicates that reported LLM misuse in software engineering is associated with assessment and instructional conditions, suggesting a need for clearer alignment between assessment design, learning objectives, and expectations for LLM use.
SEMar 31
Sustainable AI Assistance Through Digital SobrietyMadeline Jennings, Novarun Deb, Ronnie de Souza Santos
As AI assistants become commonplace in daily life, the demand for solutions that reduce the cost of inference without sacrificing utility is increasing. Existing work on AI sustainability frequently emphasizes hardware and software optimizations; however, there may be comparable value in social approaches that shape user behavior and discourage unnecessary use. In this study, we operationalize sustainability in terms of energy-efficiency and analyze a publicly sourced sample of prompts where AI is used for assistance in software development. Using this categorization, we find that nearly half of the observed queries can be considered unnecessary relative to their expected benefit. We further observe that factoid-style information retrieval constitutes the largest share of unnecessary requests, suggesting that a meaningful portion of everyday AI usage may be replaceable with lower-cost alternatives (e.g., conventional search or local documentation). These findings motivate a closer examination of how, why, and when AI systems are invoked, and what norms or interface-level nudges might reduce avoidable demand. We conclude with a call to replicate and extend this preliminary analysis and to pay greater attention to the social dimension of AI sustainability.
SEMar 25
Efficiency for Experts, Visibility for Newcomers: A Case Study of Label-Code Alignment in KubernetesMatteo Vaccargiu, Sabrina Aufiero, Silvia Bartolucci et al.
Labels on platforms such as GitHub support triage and coordination, yet little is known about how well they align with code modifications or how such alignment affects collaboration across contributor experience levels. We present a case study of the Kubernetes project, introducing label-diff congruence - the alignment between pull request labels and modified files - and examining its prevalence, stability, behavioral validation, and relationship to collaboration outcomes across contributor tiers. We analyse 18,020 pull requests (2014--2025) with area labels and complete file diffs, validate alignment through analysis of over one million review comments and label corrections, and test associations with time-to-merge and discussion characteristics using quantile regression and negative binomial models stratified by contributor experience. Congruence is prevalent (46.6\% perfect alignment), stable over years, and routinely maintained (9.2\% of PRs corrected during review). It does not predict merge speed but shapes discussion: among core developers (81\% of the sample), higher congruence predicts quieter reviews (18\% fewer participants), whereas among one-time contributors it predicts more engagement (28\% more participants). Label-diff congruence influences how collaboration unfolds during review, supporting efficiency for experienced developers and visibility for newcomers. For projects with similar labeling conventions, monitoring alignment can help detect coordination friction and provide guidance when labels and code diverge.
SEMar 29
Large Language Models in Game Development: Implications for Gameplay, Playability, and Player ExperienceKeeryn Johnson, Muhammad Ahmed, Charlie Lang et al.
This paper investigates how the integration of large language models influences gameplay, playability, and player experience in game development. We report a collaborative autoethnographic study of two game projects in which LLMs were embedded as architectural components. Reflective narratives and development artifacts were analyzed using gameplay, playability, and player experience as guiding constructs. The findings suggest that LLM integration increases variability and personalization while introducing challenges related to correctness, difficulty calibration, and structural coherence across these concepts. The study provides preliminary empirical insight into how generative AI integration reshapes established game constructs and introduces new architectural and quality considerations within game engineering practice.
SEDec 8, 2025
A Gray Literature Study on Fairness Requirements in AI-enabled Software EngineeringThanh Nguyen, Chaima Boufaied, Ronnie de Souza Santos
Today, with the growing obsession with applying Artificial Intelligence (AI), particularly Machine Learning (ML), to software across various contexts, much of the focus has been on the effectiveness of AI models, often measured through common metrics such as F1- score, while fairness receives relatively little attention. This paper presents a review of existing gray literature, examining fairness requirements in AI context, with a focus on how they are defined across various application domains, managed throughout the Software Development Life Cycle (SDLC), and the causes, as well as the corresponding consequences of their violation by AI models. Our gray literature investigation shows various definitions of fairness requirements in AI systems, commonly emphasizing non-discrimination and equal treatment across different demographic and social attributes. Fairness requirement management practices vary across the SDLC, particularly in model training and bias mitigation, fairness monitoring and evaluation, and data handling practices. Fairness requirement violations are frequently linked, but not limited, to data representation bias, algorithmic and model design bias, human judgment, and evaluation and transparency gaps. The corresponding consequences include harm in a broad sense, encompassing specific professional and societal impacts as key examples, stereotype reinforcement, data and privacy risks, and loss of trust and legitimacy in AI-supported decisions. These findings emphasize the need for consistent frameworks and practices to integrate fairness into AI software, paying as much attention to fairness as to effectiveness.
SEMar 29
Fairness Across Fields: Comparing Software Engineering and Human Sciences PerspectivesLucas Valenca, Ronnie de Souza Santos
Background. As digital technologies increasingly shape social domains such as healthcare, public safety, entertainment, and education, software engineering has engaged with ethical and political concerns primarily through the notion of algorithmic fairness. Aim. This study challenges the limits of software engineering approaches to fairness by analyzing how fairness is conceptualized in the human sciences. Methodology. We conducted two secondary studies, exploring 45 articles on algorithmic fairness in software engineering and 25 articles on fairness from the humanities, and compared their findings to assess cross-disciplinary insights for ethical technological development. Results. The analysis shows that software engineering predominantly defines fairness through formal and statistical notions focused on outcome distribution, whereas the humanities emphasize historically situated perspectives grounded in structural inequalities and power relations, with differences also evident in associated social benefits, proposed practices, and identified challenges. Conclusion. Perspectives from the human sciences can meaningfully contribute to software engineering by promoting situated understandings of fairness that move beyond technical approaches and better account for the societal impacts of technologies.
SEMar 29
Advancing Evidence-Based Social Sustainability in Software Engineering: A Research RoadmapBimpe Ayoola, Anielle Andrade, Ronnie de Souza Santos et al.
Social sustainability in software development means creating and maintaining systems that promote pro-social values (e.g., human well-being, equity), both now and in the future. However, social sustainability lacks clear conceptual and methodological foundations, and often takes a back seat to speed and profit. This paper therefore reports a narrative review of existing definitions of social sustainability in software development and identifies key aspects of social sustainability including social equity, well-being, and community cohesion. Challenges around measuring and integrating social sustainability into practice are conceptually analyzed. The paper then proposes a comprehensive definition of social sustainability and outlines a roadmap for measuring and integrating social sustainability into software engineering processes.
SEApr 10
Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid ReviewCorey Yang-Smith, Ronnie de Souza Santos, Ahmad Abdellatif
Transformer-based large language models (LLMs) and multi-agent systems (MAS) are increasingly embedded across the software development lifecycle (SDLC), yet their fairness implications for developer-facing tools remain underexplored despite their growing role in shaping what code is written, reviewed, and released. We present a rapid review of recent work on fairness in MAS, emphasizing LLM-enabled settings and relevance to software engineering. Starting from an initial set of 350 papers, we screened and filtered the corpus for relevance, retaining 18 studies for final analysis. Across these 18 studies, fairness is framed as a combination of trustworthy AI principles, bias reduction across groups, and interactional dynamics in collectives, while evaluation spans accuracy metrics on bias benchmarks, demographic disparity measures, and emergent MAS-specific notions such as conformity and bias amplification. Reported harms include representational, quality-of-service, security and privacy, and governance failures, which we relate to SDLC stages where evidence is most and least developed. We identify three persistent gaps: (1) fragmented, rarely MAS-specific evaluation practices that limit comparability, (2) limited generalization due to simplified environments and narrow attribute coverage, and (3) scarce, weakly evaluated mitigation and governance mechanisms aligned to real software workflows. These findings suggest MAS fairness research is not yet ready to support deployable, fairness-assured software systems, motivating MAS-aware benchmarks, consistent protocols, and lifecycle-spanning governance.
SEApr 28
Supporting Belonging in Software Engineering Through Role Models ExposureRonnie de Souza Santos
Role models are widely discussed in educational research as influential in students identity development and sense of belonging, yet less attention has been given to how role model visibility can be systematically embedded within everyday engineering instruction. This paper presents an analytic autoethnographic account of integrating historically grounded role models into routine software engineering teaching practice. Drawing on reflective memos and instructional artifacts across multiple course offerings, we characterize how brief, topic aligned contextualizations of pioneers were incorporated into core technical lectures without altering learning objectives or assessments. The findings indicate that this structurally embedded approach functioned as a low disruption pedagogical practice that aligned representation with disciplinary substance, situating diverse contributors as foundational to the development of software architecture. The integration was iterative and refined across semesters to strengthen topic alignment and instructional flow. These results suggest that embedding historically grounded representation within technical content may serve as a practical mechanism for supporting inclusivity while preserving technical rigor in engineering education.
CLApr 22
Intersectional Fairness in Large Language ModelsChaima Boufaied, Ronnie De Souza Santos, Ann Barcomb
Large Language Models (LLMs) are increasingly deployed in socially sensitive settings, raising concerns about fairness and biases, particularly across intersectional demographic attributes. In this paper, we systematically evaluate intersectional fairness in six LLMs using ambiguous and disambiguated contexts from two benchmark datasets. We assess LLM behavior using bias scores, subgroup fairness metrics, accuracy, and consistency through multi-run analysis across contexts and negative and non-negative question polarities. Our results show that while modern LLMs generally perform well in ambiguous contexts, this limits the informativeness of fairness metrics due to sparse non-unknown predictions. In disambiguated contexts, LLM accuracy is influenced by stereotype alignment, with models being more accurate when the correct answer reinforces a stereotype than when it contradicts it. This pattern is especially pronounced in race-gender intersections, where directional bias toward stereotypes is stronger. Subgroup fairness metrics further indicate that, despite low observed disparity in some cases, outcome distributions remain uneven across intersectional groups. Across repeated runs, responses also vary in consistency, including stereotype-aligned responses. Overall, our findings show that apparent model competence is partly associated with stereotype-consistent cues, and no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings. These findings highlight the need for evaluation beyond accuracy, emphasizing the importance of combining bias, subgroup fairness, and consistency metrics across intersectional groups, contexts, and repeated runs.
CYApr 6
Teaching Empathy in Software Engineering Education in the Age of Artificial IntelligenceRonnie de Souza Santos, Cleyton Magalhães, Giuseppe Destefanis et al.
Empathy has been discussed as a relevant human capability in software engineering, particularly in activities that require understanding users, stakeholders, and the societal implications of technological systems. This relevance becomes more pronounced in the context of artificial intelligence, where software increasingly participates in decisions that affect diverse individuals and communities. However, limited guidance exists on how empathy can be integrated into technical software engineering education in ways that connect with the development of AI-enabled systems. This study investigates teaching practices that educators use to incorporate empathy into software engineering courses. Using qualitative analysis of educator-reported practices, we identified five categories through which empathy is operationalized within technical coursework: societal framing of AI systems, fairness and accessibility considerations in design and evaluation, representation of diverse users, stakeholder role awareness and responsibility, and structured reflection and feedback during development processes. The findings indicate that empathy can be embedded within core development activities rather than taught as a separate topic, enabling students to reason about bias, accessibility, accountability, and the societal consequences of AI technologies. These results contribute a structured view of how empathy-oriented practices can be incorporated into software engineering education to support the preparation of students who will develop AI-enabled systems.
SEMar 23
On the Economic Implications of Diversity in Software EngineeringSofia Tapias Montana, Ronnie de Souza Santos
This paper investigates how software professionals perceive the economic implications of diversity in software engineering teams. Motivated by a gap in software engineering research, which has largely emphasized socio-technical and process-related outcomes, we adopted a qualitative interview approach to capture practitioners' reasoning about diversity in relation to economic and market-oriented considerations. Based on interviews with ten software professionals, our analysis indicates that diversity is perceived as economically relevant through its associations with cost reduction and containment, revenue generation, time to market, process efficiency, innovation, and market alignment. Participants typically grounded these perceptions in concrete project experiences rather than abstract economic reasoning, framing diversity as a practical resource that supports project delivery, competitiveness, and organizational viability. Our findings provide preliminary empirical insights into how economic aspects of diversity are understood in software engineering practice.
SEMar 8
Empathy in Software Engineering Education: Evidence, Practices, and OpportunitiesMatheus de Morais Leca, Kim Johnston, Ronnie de Souza Santos
\textbf{Context:} Empathy is increasingly recognized as a critical human capability for software engineers, supporting collaboration, ethical awareness, and user-centered design. While many disciplines have long explored empathy as part of professional formation, its incorporation into software engineering education remains fragmented. \textbf{Aim:} This study investigates how empathy has been used, taught, and discussed in general engineering and software engineering education, with the goal of identifying pedagogical practices, outcomes, and disciplinary differences that inform the structured integration of empathy into software curricula. \textbf{Method:} Following established guidelines for systematic reviews in software engineering, we conducted a comprehensive search across six databases and analyzed 43 primary studies published between 2001 and 2025. Data were coded and synthesized using descriptive and thematic analysis to capture how empathy is conceptualized, fostered, and assessed across educational contexts. \textbf{Findings:} Our findings show that engineering programs frame empathy as an ethical and reflective capacity linked to social responsibility, whereas software engineering translates empathy into structured, design-oriented, and measurable practices. Across both domains, empathy teaching enhances collaboration, ethical reasoning, bias awareness, and motivation, but remains limited by low curricular prioritization, measurement challenges, and resource constraints. \textbf{Conclusion:} Empathy is evolving from a peripheral soft skill into a measurable pedagogical construct in software engineering education. Embedding empathy as a continuous, assessable component of design and development courses can strengthen inclusivity, ethical reflection, and responsible innovation in future software professionals.
SEMar 12
How Fair is Software Fairness Testing?Ann Barcomb, Mariana Pinheiro Bento, Giuseppe Destefanis et al.
Software fairness testing is a central method for evaluating AI systems, yet the meaning of fairness is often treated as fixed and universally applicable. This vision paper positions fairness testing as culturally situated and examines the problem across three dimensions. First, fairness metrics encode particular cultural values while marginalizing others. Second, test datasets are predominantly designed from Western contexts, excluding knowledge systems grounded in oral traditions, Indigenous languages, and non-digital communities. Third, fairness testing raises ethical concerns, including the reliance on low-paid data labeling in the Global South, and associated with this, the environmental costs of training and deploying large-scale models, which disproportionately affect climate-vulnerable populations. Addressing these issues requires rethinking fairness testing beyond universal metrics and moving toward evaluation frameworks that respect cultural plurality and acknowledge the right to refuse algorithmic mediation.
SEMar 12
Team Diversity Promotes Software Fairness: An Experiment on Fairness-Aware Requirements PrioritizationCleyton Magalhes, Ronnie de Souza Santos, Bimpe Ayoola et al.
\textbf{Background:} Fairness and diversity are receiving growing attention in software engineering, particularly as AI and machine learning systems increasingly influence decision-making processes. While fairness is often examined at the algorithmic or data level, there is limited understanding of how it is addressed during the early stages of software development. Moreover, little is known about how team diversity affects fairness-related decisions in software projects. \textbf{Aims:} This study investigates how diversity in software teams influences fairness-aware behavior during requirements prioritization. \textbf{Method:} A controlled experiment was conducted with 27 pairs of software engineering students, including 13 LGBTQ diverse pairs and 14 non diverse pairs. Each pair prioritized user stories with varying fairness implications. Descriptive statistics were used to analyze attitudes and prioritization outcomes, and thematic analysis was applied to examine the reasoning behind participants' decisions. \textbf{Results:} Both groups demonstrated general alignment with fairness principles, prioritizing features that promoted equitable treatment and rejecting those that posed fairness risks. However, LGBTQ diverse pairs were more consistent in rejecting fairness risking stories and made fewer fairness related misprioritization errors. Their reasoning emphasized inclusion, non discrimination, and ethical responsibility, whereas non diverse pairs adopted a more pragmatic, goal oriented perspective. \textbf{Conclusions:} The findings indicate that fairness should be considered from the earliest stages of software development. Team diversity can enhance the identification and interpretation of fairness issues during requirements analysis, fostering more reflective and inclusive decision making.
SEMar 8
Regression Testing in Remote and Hybrid Software Teams: An Exploratory Study of Processes, Tools, and PracticesJuliane Pascoal, Cleytton Magalhaes, Ronnie de Souza Santos
Remote and hybrid work have transformed how software development teams organize, communicate, and assure quality. This study investigates how regression testing is performed and experienced under these distributed conditions. Using qualitative interviews with twenty software professionals from diverse organizations, we analyzed how regression testing processes, tools, and coordination practices adapt to remote and hybrid environments. The results show that while the core phases of regression testing remain stable, their execution increasingly depends on documentation, automation, and tool integration to support asynchronous collaboration. Communication and coordination challenges were mitigated through standardized reporting, shared repositories, and traceability mechanisms that replaced informal co-located interactions. These findings reveal regression testing as a socio-technical practice shaped by the interaction between human collaboration and digital infrastructure. Our study contributes to understanding how software quality assurance evolves under remote conditions and offers practical implications for teams and organizations adopting hybrid work models.
SEMar 8
The role of team diversity in AI systems developmentRonnie de Souza Santos, Maria Teresa Baldassarre, Cleyton Magalhaes
The widespread integration of AI technologies has intensified concerns about fairness and bias, as these systems often perpetuate societal inequalities through flawed data and design choices. While software engineering research has largely concentrated on technical solutions, such as improving datasets and models, the social dynamics that shape AI outcomes remain underexplored. This study investigates the role of team diversity in the development of AI systems. Drawing from the experience of four AI focused teams working in a large software company operating in Brazil and Portugal, and collaborating with global clients, the study explores how diverse teams influence the development of AI systems. Using Grounded Theory, we conducted 25 interviews with software professionals involved in projects spanning domains such as education, energy, accessibility, and facial recognition. Although our study is conducted in an organizational setting, the variety of projects, from regional to multinational, ensures exposure to global development practices and diverse team dynamics, bringing a variety of perspectives into our findings. Our analysis revealed six key roles that team diversity played in AI development: diversifying perspectives for bias identification, bringing empathy to AI development, addressing systemic discrimination, supporting inclusive and participatory decision making, using diversity as a safeguard against bias, and fostering broadened thinking in problem solving. These findings highlight the importance of incorporating diverse perspectives in AI projects and offer practical recommendations for integrating fairness considerations into software development practices.