SEAug 21, 2023Code
Large Language Models for Software Engineering: A Systematic Literature ReviewXinyi Hou, Yanjie Zhao, Yue Liu et al.
Large Language Models (LLMs) have significantly impacted numerous domains, including Software Engineering (SE). Many recent publications have explored LLMs applied to various SE tasks. Nevertheless, a comprehensive understanding of the application, effects, and possible limitations of LLMs on SE is still in its early stages. To bridge this gap, we conducted a systematic literature review (SLR) on LLM4SE, with a particular focus on understanding how LLMs can be exploited to optimize processes and outcomes. We select and analyze 395 research papers from January 2017 to January 2024 to answer four key research questions (RQs). In RQ1, we categorize different LLMs that have been employed in SE tasks, characterizing their distinctive features and uses. In RQ2, we analyze the methods used in data collection, preprocessing, and application, highlighting the role of well-curated datasets for successful LLM for SE implementation. RQ3 investigates the strategies employed to optimize and evaluate the performance of LLMs in SE. Finally, RQ4 examines the specific SE tasks where LLMs have shown success to date, illustrating their practical contributions to the field. From the answers to these RQs, we discuss the current state-of-the-art and trends, identifying gaps in existing research, and flagging promising areas for future study. Our artifacts are publicly available at https://github.com/xinyi-hou/LLM4SE_SLR.
CRSep 20, 2022Code
Statement-Level Vulnerability Detection: Learning Vulnerability Patterns Through Information Theory and Contrastive LearningVan Nguyen, Trung Le, Chakkrit Tantithamthavorn et al.
Software vulnerabilities are a serious and crucial concern. Typically, in a program or function consisting of hundreds or thousands of source code statements, there are only a few statements causing the corresponding vulnerabilities. Most current approaches to vulnerability labelling are done on a function or program level by experts with the assistance of machine learning tools. Extending this approach to the code statement level is much more costly and time-consuming and remains an open problem. In this paper, we propose a novel end-to-end deep learning-based approach to identify the vulnerability-relevant code statements of a specific function. Inspired by the specific structures observed in real-world vulnerable code, we first leverage mutual information for learning a set of latent variables representing the relevance of the source code statements to the corresponding function's vulnerability. We then propose novel clustered spatial contrastive learning in order to further improve the representation learning and the robust selection process of vulnerability-relevant code statements. Experimental results on real-world datasets of 200k+ C/C++ functions show the superiority of our method over other state-of-the-art baselines. In general, our method obtains a higher performance in VCP, VCA, and Top-10 ACC measures of between 3% to 14% over the baselines when running on real-world datasets in an unsupervised setting. Our released source code samples are publicly available at \href{https://github.com/vannguyennd/livuitcl}{https://github.com/vannguyennd/livuitcl.}
CRSep 19, 2022Code
Cross Project Software Vulnerability Detection via Domain Adaptation and Max-Margin PrincipleVan Nguyen, Trung Le, Chakkrit Tantithamthavorn et al.
Software vulnerabilities (SVs) have become a common, serious and crucial concern due to the ubiquity of computer software. Many machine learning-based approaches have been proposed to solve the software vulnerability detection (SVD) problem. However, there are still two open and significant issues for SVD in terms of i) learning automatic representations to improve the predictive performance of SVD, and ii) tackling the scarcity of labeled vulnerabilities datasets that conventionally need laborious labeling effort by experts. In this paper, we propose a novel end-to-end approach to tackle these two crucial issues. We first exploit the automatic representation learning with deep domain adaptation for software vulnerability detection. We then propose a novel cross-domain kernel classifier leveraging the max-margin principle to significantly improve the transfer learning process of software vulnerabilities from labeled projects into unlabeled ones. The experimental results on real-world software datasets show the superiority of our proposed method over state-of-the-art baselines. In short, our method obtains a higher performance on F1-measure, the most important measure in SVD, from 1.83% to 6.25% compared to the second highest method in the used datasets. Our released source code samples are publicly available at https://github.com/vannguyennd/dam2p
SEMar 6, 2023
Requirements Engineering Framework for Human-centered Artificial Intelligence Software SystemsKhlood Ahmad, Mohamed Abdelrazek, Chetan Arora et al.
[Context] Artificial intelligence (AI) components used in building software solutions have substantially increased in recent years. However, many of these solutions focus on technical aspects and ignore critical human-centered aspects. [Objective] Including human-centered aspects during requirements engineering (RE) when building AI-based software can help achieve more responsible, unbiased, and inclusive AI-based software solutions. [Method] In this paper, we present a new framework developed based on human-centered AI guidelines and a user survey to aid in collecting requirements for human-centered AI-based software. We provide a catalog to elicit these requirements and a conceptual model to present them visually. [Results] The framework is applied to a case study to elicit and model requirements for enhancing the quality of 360 degree~videos intended for virtual reality (VR) users. [Conclusion] We found that our proposed approach helped the project team fully understand the human-centered needs of the project to deliver. Furthermore, the framework helped to understand what requirements need to be captured at the initial stages against later stages in the engineering process of AI-based software.
SENov 1, 2023
Model-driven Engineering for Machine Learning Components: A Systematic Literature ReviewHira Naveed, Chetan Arora, Hourieh Khalajzadeh et al.
Context: Machine Learning (ML) has become widely adopted as a component in many modern software applications. Due to the large volumes of data available, organizations want to increasingly leverage their data to extract meaningful insights and enhance business profitability. ML components enable predictive capabilities, anomaly detection, recommendation, accurate image and text processing, and informed decision-making. However, developing systems with ML components is not trivial; it requires time, effort, knowledge, and expertise in ML, data processing, and software engineering. There have been several studies on the use of model-driven engineering (MDE) techniques to address these challenges when developing traditional software and cyber-physical systems. Recently, there has been a growing interest in applying MDE for systems with ML components. Objective: The goal of this study is to further explore the promising intersection of MDE with ML (MDE4ML) through a systematic literature review (SLR). Through this SLR, we wanted to analyze existing studies, including their motivations, MDE solutions, evaluation techniques, key benefits and limitations. Results: We analyzed selected studies with respect to several areas of interest and identified the following: 1) the key motivations behind using MDE4ML; 2) a variety of MDE solutions applied, such as modeling languages, model transformations, tool support, targeted ML aspects, contributions and more; 3) the evaluation techniques and metrics used; and 4) the limitations and directions for future work. We also discuss the gaps in existing literature and provide recommendations for future research. Conclusion: This SLR highlights current trends, gaps and future research directions in the field of MDE4ML, benefiting both researchers and practitioners
SEApr 12
Towards an Appropriate Level of Reliance on AI: A Preliminary Reliance-Control Framework for AI in Software EngineeringSamuel Ferino, Rashina Hoda, John Grundy et al.
How software developers interact with Artificial Intelligence (AI)-powered tools, including Large Language Models (LLMs), plays a vital role in how these AI-powered tools impact them. While overreliance on AI may lead to long-term negative consequences (e.g., atrophy of critical thinking skills); underreliance might deprive software developers of potential gains in productivity and quality. Based on twenty-two interviews with software developers on using LLMs for software development, we propose a preliminary reliance-control framework where the level of control can be used as a way to identify AI overreliance and underreliance. We also use it to recommend future research to further explore the different control levels supported by the current and emergent LLM-driven tools. Our paper contributes to the emerging discourse on AI overreliance and provides an understanding of the appropriate degree of reliance as essential to developers making the most of these powerful technologies. Our findings can help practitioners, educators, and policymakers promote responsible and effective use of AI tools.
SENov 9, 2025
Walking the Tightrope of LLMs for Software Development: A Practitioners' PerspectiveSamuel Ferino, Rashina Hoda, John Grundy et al.
Background: Large Language Models emerged with the potential of provoking a revolution in software development (e.g., automating processes, workforce transformation). Although studies have started to investigate the perceived impact of LLMs for software development, there is a need for empirical studies to comprehend how to balance forward and backward effects of using LLMs. Objective: We investigated how LLMs impact software development and how to manage the impact from a software developer's perspective. Method: We conducted 22 interviews with software practitioners across 3 rounds of data collection and analysis, between October (2024) and September (2025). We employed socio-technical grounded theory (STGT) for data analysis to rigorously analyse interview participants' responses. Results: We identified the benefits (e.g., maintain software development flow, improve developers' mental model, and foster entrepreneurship) and disadvantages (e.g., negative impact on developers' personality and damage to developers' reputation) of using LLMs at individual, team, organisation, and society levels; as well as best practices on how to adopt LLMs. Conclusion: Critically, we present the trade-offs that software practitioners, teams, and organisations face in working with LLMs. Our findings are particularly useful for software team leaders and IT managers to assess the viability of LLMs within their specific context.
SEMay 6
Engineering for Crisis Management: A User-Centred Analysis of Disaster Mobile ApplicationsMuhamad Syukron, Anuradha Madugalla, Mojtaba Shahin et al.
Disaster mobile apps play an increasingly important role in disseminating hazard information and supporting communities during emergency situations. This study presents a comprehensive analysis of these mobile applications, focusing on their features, user-reported challenges, and opportunities for improvement. We first examined the landscape of disaster mobile apps by analysing 70 apps identified through a combination of methods, including those from the literature, the Google Play Store, and the App Store. The analysis categorised apps based on disaster focus, geographic coverage, popularity, monetisation strategies, and features across the disaster lifecycle. We then extracted, translated and analysed user reviews using topic modelling and sentiment analysis to identify key concerns and recurring issues. The results show that most applications prioritise response-related functionalities, with limited support for preparedness and recovery. User feedback highlights critical challenges related to technical reliability, usability, accessibility, and information clarity. Based on these findings, we propose a set of recommendations for developers and emergency management agencies to improve the reliability, inclusiveness, and overall effectiveness of disaster mobile apps. These include adopting lifecycle-oriented design approaches, strengthening multilingual support, improving technical robustness, and integrating user feedback into development processes. This work contributes to the growing body of research on human-centred disaster risk reduction by providing empirical insights and actionable guidance for the design of more reliable and inclusive disaster communication systems.
SEMar 10, 2025Code
Novice Developers' Perspectives on Adopting LLMs for Software Development: A Systematic Literature ReviewSamuel Ferino, Rashina Hoda, John Grundy et al.
Following the rise of large language models (LLMs), many studies have emerged in recent years focusing on exploring the adoption of LLM-based tools for software development by novice developers: computer science/software engineering students and early-career industry developers with two years or less of professional experience. These studies have sought to understand the perspectives of novice developers on using these tools, a critical aspect of the successful adoption of LLMs in software engineering. To systematically collect and summarise these studies, we conducted a systematic literature review (SLR) following the guidelines by Kitchenham et al. on 80 primary studies published between April 2022 and June 2025 to answer four research questions (RQs). In answering RQ1, we categorised the study motivations and methodological approaches. In RQ2, we identified the software development tasks for which novice developers use LLMs. In RQ3, we categorised the advantages, challenges, and recommendations discussed in the studies. Finally, we discuss the study limitations and future research needs suggested in the primary studies in answering RQ4. Throughout the paper, we also indicate directions for future work and implications for software engineering researchers, educators, and developers. Our research artifacts are publicly available at https://github.com/Samuellucas97/SupplementaryInfoPackage-SLR.
SEMar 31
Towards Explainable Stakeholder-Aware Requirements Prioritisation in Aged-Care Digital HealthYuqing Xiao, John Grundy, Anuradha Madugalla et al.
Requirements engineering for aged-care digital health must account for human aspects, because requirement priorities are shaped not only by technical functionality but also by stakeholders' health conditions, socioeconomics, and lived experience. Knowing which human aspects matter most, and for whom, is critical for inclusive and evidence-based requirements prioritisation. Yet in practice, while some studies have examined human aspects in RE, they have largely relied on expert judgement or model-driven analysis rather than large-scale user studies with meaningful human-in-the-loop validation to determine which aspects matter most and why. To address this gap, we conducted a mixed-methods study with 103 older adults, 105 developers, and 41 caregivers. We first applied an explainable machine learning to identify the human aspects most strongly associated with requirement priorities across 8 aged-care digital health themes, and then conducted 12 semi-structured interviews to validate and interpret the quantitative patterns. The results identify the key human aspects shaping requirement priorities, reveal their directional effects, and expose substantial misalignment across stakeholder groups. Together, these findings show that human-centric requirements analysis should engage stakeholder groups explicitly rather than collapsing their perspectives into a single aggregate view. This paper contributes an identification of the key human aspects driving requirement priorities in aged-care digital health and an explainable, human-centric RE framework that combines ML-derived importance rankings with qualitative validation to surface the stakeholder misalignments that inclusive requirements engineering must address.
CRApr 24
SSG: Logit-Balanced Vocabulary Partitioning for LLM WatermarkingChenxi Gu, Xiaoning Du, John Grundy
Watermarking has emerged as a promising technique for tracing the authorship of content generated by large language models (LLMs). Among existing approaches, the KGW scheme is particularly attractive due to its versatility, efficiency, and effectiveness in natural language generation. However, KGW's effectiveness degrades significantly under low-entropy settings such as code generation and mathematical reasoning. A crucial step in the KGW method is random vocabulary partitioning, which enables adjustments to token selection based on specific preferences. Our study revealed that the next-token probability distribution plays an critical role in determining how much, or even whether, we can modify token selection and, consequently, the effectiveness of watermarking. We refer to this characteristic, associated with the probability distribution of each token prediction, as \emph{watermark strength.} In cases of random vocabulary partitioning, the lower bound of watermark strength is dictated by the next-token probability distribution. However, we found that, by redesigning the vocabulary partitioning algorithm, we can potentially raise this lower bound. In this paper, we propose SSG (\textbf{S}ort-then-\textbf{S}plit by \textbf{G}roups), a method that partitions the vocabulary into two logit-balanced subsets. This design lifts the lower bound of watermark strength for each token prediction, thereby improving watermark detectability. Experiments on code generation and mathematical reasoning datasets demonstrate the effectiveness of SSG.
SESep 6, 2020Code
DEFECTCHECKER: Automated Smart Contract Defect Detection by Analyzing EVM BytecodeJiachi Chen, Xin Xia, David Lo et al.
Smart contracts are Turing-complete programs running on the blockchain. They are immutable and cannot be modified, even when bugs are detected. Therefore, ensuring smart contracts are bug-free and well-designed before deploying them to the blockchain is extremely important. A contract defect is an error, flaw or fault in a smart contract that causes it to produce an incorrect or unexpected result, or to behave in unintended ways. Detecting and removing contract defects can avoid potential bugs and make programs more robust. Our previous work defined 20 contract defects for smart contracts and divided them into five impact levels. According to our classification, contract defects with seriousness level between 1-3 can lead to unwanted behaviors, e.g., a contract being controlled by attackers. In this paper, we propose DefectChecker, a symbolic execution-based approach and tool to detect eight contract defects that can cause unwanted behaviors of smart contracts on the Ethereum blockchain platform. DefectChecker can detect contract defects from smart contracts bytecode. We compare DefectChecker with key previous works, including Oyente, Mythril and Securify by using an open-source dataset. Our experimental results show that DefectChecker performs much better than these tools in terms of both speed and accuracy. We also applied DefectChecker to 165,621 distinct smart contracts on the Ethereum platform. We found that 25,815 of these smart contracts contain at least one of the contract defects that belongs to impact level 1-3, including some real-world attacks.
SEFeb 3, 2018Code
A deep tree-based model for software defect predictionHoa Khanh Dam, Trang Pham, Shien Wee Ng et al.
Defects are common in software systems and can potentially cause various problems to software users. Different methods have been developed to quickly predict the most likely locations of defects in large code bases. Most of them focus on designing features (e.g. complexity metrics) that correlate with potentially defective code. Those approaches however do not sufficiently capture the syntax and different levels of semantics of source code, an important capability for building accurate prediction models. In this paper, we develop a novel prediction model which is capable of automatically learning features for representing source code and using them for defect prediction. Our prediction system is built upon the powerful deep learning, tree-structured Long Short Term Memory network which directly matches with the Abstract Syntax Tree representation of source code. An evaluation on two datasets, one from open source projects contributed by Samsung and the other from the public PROMISE repository, demonstrates the effectiveness of our approach for both within-project and cross-project predictions.
SEDec 23, 2024
RepoTransBench: A Real-World Benchmark for Repository-Level Code TranslationYanli Wang, Yanlin Wang, Suiquan Wang et al.
Repository-level code translation refers to translating an entire code repository from one programming language to another while preserving the functionality of the source repository. Many benchmarks have been proposed to evaluate the performance of such code translators. However, previous benchmarks mostly provide fine-grained samples, focusing at either code snippet, function, or file-level code translation. Such benchmarks do not accurately reflect real-world demands, where entire repositories often need to be translated, involving longer code length and more complex functionalities. To address this gap, we propose a new benchmark, named RepoTransBench, which is a real-world repository-level code translation benchmark with an automatically executable test suite. We conduct experiments on RepoTransBench to evaluate the translation performance of 11 advanced LLMs. We find that the Success@1 score (test success in one attempt) of the best-performing LLM is only 7.33%. To further explore the potential of LLMs for repository-level code translation, we provide LLMs with error-related feedback to perform iterative debugging and observe an average 7.09% improvement on Success@1. However, even with this improvement, the Success@1 score of the best-performing LLM is only 21%, which may not meet the need for reliable automatic repository-level code translation. Finally, we conduct a detailed error analysis and highlight current LLMs' deficiencies in repository-level code translation, which could provide a reference for further improvements.
CYJan 8, 2025
Ethical Concerns of Generative AI and Mitigation Strategies: A Systematic Mapping StudyYutan Huang, Chetan Arora, Wen Cheng Houng et al.
[Context] Generative AI technologies, particularly Large Language Models (LLMs), have transformed numerous domains by enhancing convenience and efficiency in information retrieval, content generation, and decision-making processes. However, deploying LLMs also presents diverse ethical challenges, and their mitigation strategies remain complex and domain-dependent. [Objective] This paper aims to identify and categorize the key ethical concerns associated with using LLMs, examine existing mitigation strategies, and assess the outstanding challenges in implementing these strategies across various domains. [Method] We conducted a systematic mapping study, reviewing 39 studies that discuss ethical concerns and mitigation strategies related to LLMs. We analyzed these ethical concerns using five ethical dimensions that we extracted based on various existing guidelines, frameworks, and an analysis of the mitigation strategies and implementation challenges. [Results] Our findings reveal that ethical concerns in LLMs are multi-dimensional and context-dependent. While proposed mitigation strategies address some of these concerns, significant challenges still remain. [Conclusion] Our results highlight that ethical issues often hinder the practical implementation of the mitigation strategies, particularly in high-stake areas like healthcare and public governance; existing frameworks often lack adaptability, failing to accommodate evolving societal expectations and diverse contexts.
SEApr 16
Requirements Perception Gap across Stakeholders: A Comparative Survey of Aged Care Digital Health SoftwareYuqing Xiao, John Grundy, Anuradha Madugalla et al.
We sought to explore and compare the perspectives of three key stakeholder groups: older adults, caregivers (formal health providers and informal caregivers), and digital health software developers on key functional and non-functional requirements. We conducted a survey, designed based on the findings from an existing systematic review, to gather and analyse data related to the three stakeholder groups' (dis)satisfaction with current aged care digital health software and their views on key future aged care software requirements. A mixed-methods survey approach integrated quantitative questionnaire data and qualitative open-ended responses from a total sample of 249, comprised of older adults (103), formal and informal caregivers (41), and software developers (105). Data analysis utilised a mixed methods approach, employing inferential statistics to compare group satisfaction levels and thematic analysis for qualitative open-ended responses. Our analysis reveals a significant "Requirements Gap". Software developers tend to prioritise advanced features and functional requirements, significantly overestimating user satisfaction with core NFRs such as ease of use and responsiveness. Conversely, developers were more critical of existing functional features compared to older adults and caregivers, who prioritised simplicity and reliability over feature density. By combining quantitative and qualitative analysis, we identified where stakeholder priorities align and where they diverge across functional and non-functional requirements in both the current designs they used and the future designs they desire. Our findings present a stakeholder gap analysis that can guide future co-design processes, near-term product decisions, and privacy-by-design recommendations in aged care digital health.
AIJan 14, 2025
Advice for Diabetes Self-Management by ChatGPT Models: Challenges and RecommendationsWaqar Hussain, John Grundy
Given their ability for advanced reasoning, extensive contextual understanding, and robust question-answering abilities, large language models have become prominent in healthcare management research. Despite adeptly handling a broad spectrum of healthcare inquiries, these models face significant challenges in delivering accurate and practical advice for chronic conditions such as diabetes. We evaluate the responses of ChatGPT versions 3.5 and 4 to diabetes patient queries, assessing their depth of medical knowledge and their capacity to deliver personalized, context-specific advice for diabetes self-management. Our findings reveal discrepancies in accuracy and embedded biases, emphasizing the models' limitations in providing tailored advice unless activated by sophisticated prompting techniques. Additionally, we observe that both models often provide advice without seeking necessary clarification, a practice that can result in potentially dangerous advice. This underscores the limited practical effectiveness of these models without human oversight in clinical settings. To address these issues, we propose a commonsense evaluation layer for prompt evaluation and incorporating disease-specific external memory using an advanced Retrieval Augmented Generation technique. This approach aims to improve information quality and reduce misinformation risks, contributing to more reliable AI applications in healthcare settings. Our findings seek to influence the future direction of AI in healthcare, enhancing both the scope and quality of its integration.
SESep 17, 2025
Monitoring Machine Learning Systems: A Multivocal Literature ReviewHira Naveed, Scott Barnett, Chetan Arora et al.
Context: Dynamic production environments make it challenging to maintain reliable machine learning (ML) systems. Runtime issues, such as changes in data patterns or operating contexts, that degrade model performance are a common occurrence in production settings. Monitoring enables early detection and mitigation of these runtime issues, helping maintain users' trust and prevent unwanted consequences for organizations. Aim: This study aims to provide a comprehensive overview of the ML monitoring literature. Method: We conducted a multivocal literature review (MLR) following the well established guidelines by Garousi to investigate various aspects of ML monitoring approaches in 136 papers. Results: We analyzed selected studies based on four key areas: (1) the motivations, goals, and context; (2) the monitored aspects, specific techniques, metrics, and tools; (3) the contributions and benefits; and (4) the current limitations. We also discuss several insights found in the studies, their implications, and recommendations for future research and practice. Conclusion: Our MLR identifies and summarizes ML monitoring practices and gaps, emphasizing similarities and disconnects between formal and gray literature. Our study is valuable for both academics and practitioners, as it helps select appropriate solutions, highlights limitations in current approaches, and provides future directions for research and tool development.
SENov 6, 2024
Requirements Engineering for Older Adult Digital Health Software: A Systematic Literature ReviewYuqing Xiao, John Grundy, Anuradha Madugalla
Growth of the older adult population has led to an increasing interest in technology-supported aged care. However, the area has some challenges such as a lack of caregivers and limitations in understanding the emotional, social, physical, and mental well-being needs of seniors. Furthermore, there is a gap in the understanding between developers and ageing people of their requirements. Digital health can be important in supporting older adults wellbeing, emotional requirements, and social needs. Requirements Engineering (RE) is a major software engineering field, which can help to identify, elicit and prioritize the requirements of stakeholders and ensure that the systems meet standards for performance, reliability, and usability. We carried out a systematic review of the literature on RE for older adult digital health software. This was necessary to show the representatives of the current stage of understanding the needs of older adults in aged care digital health. Using established guidelines outlined by the Kitchenham method, the PRISMA and the PICO guideline, we developed a protocol, followed by the systematic exploration of eight databases. This resulted in 69 primary studies of high relevance, which were subsequently subjected to data extraction, synthesis, and reporting. We highlight key RE processes in digital health software for ageing people. It explored the utilization of technology for older user well-being and care, and the evaluations of such solutions. The review also identified key limitations found in existing primary studies that inspire future research opportunities. The results indicate that requirement gathering and understanding have a significant variation between different studies. The differences are in the quality, depth, and techniques adopted for requirement gathering and these differences are largely due to uneven adoption of RE methods.
SEOct 13, 2025
Generative AI for Software Project Management: Insights from a Review of Software Practitioner LiteratureLakshana Iruni Assalaarachchi, Zainab Masood, Rashina Hoda et al.
Software practitioners are discussing GenAI transformations in software project management openly and widely. To understand the state of affairs, we performed a grey literature review using 47 publicly available practitioner sources including blogs, articles, and industry reports. We found that software project managers primarily perceive GenAI as an "assistant", "copilot", or "friend" rather than as a "PM replacement", with support of GenAI in automating routine tasks, predictive analytics, communication and collaboration, and in agile practices leading to project success. Practitioners emphasize responsible GenAI usage given concerns such as hallucinations, ethics and privacy, and lack of emotional intelligence and human judgment. We present upskilling requirements for software project managers in the GenAI era mapped to the Project Management Institute's talent triangle. We share key recommendations for both practitioners and researchers.
SEOct 9, 2025
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via ExecutionTerry Yue Zhuo, Xiaolong Jin, Hange Liu et al.
Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.
SEAug 20, 2025
Understanding Practitioners Perspectives on Monitoring Machine Learning SystemsHira Naveed, John Grundy, Chetan Arora et al.
Given the inherent non-deterministic nature of machine learning (ML) systems, their behavior in production environments can lead to unforeseen and potentially dangerous outcomes. For a timely detection of unwanted behavior and to prevent organizations from financial and reputational damage, monitoring these systems is essential. This paper explores the strategies, challenges, and improvement opportunities for monitoring ML systems from the practitioners perspective. We conducted a global survey of 91 ML practitioners to collect diverse insights into current monitoring practices for ML systems. We aim to complement existing research through our qualitative and quantitative analyses, focusing on prevalent runtime issues, industrial monitoring and mitigation practices, key challenges, and desired enhancements in future monitoring tools. Our findings reveal that practitioners frequently struggle with runtime issues related to declining model performance, exceeding latency, and security violations. While most prefer automated monitoring for its increased efficiency, many still rely on manual approaches due to the complexity or lack of appropriate automation solutions. Practitioners report that the initial setup and configuration of monitoring tools is often complicated and challenging, particularly when integrating with ML systems and setting alert thresholds. Moreover, practitioners find that monitoring adds extra workload, strains resources, and causes alert fatigue. The desired improvements from the practitioners perspective are: automated generation and deployment of monitors, improved support for performance and fairness monitoring, and recommendations for resolving runtime issues. These insights offer valuable guidance for the future development of ML monitoring tools that are better aligned with practitioners needs.
CRJan 17, 2022
Characterizing Sensor Leaks in Android AppsXiaoyu Sun, Xiao Chen, Kui Liu et al.
While extremely valuable to achieve advanced functions, mobile phone sensors can be abused by attackers to implement malicious activities in Android apps, as experimentally demonstrated by many state-of-the-art studies. There is hence a strong need to regulate the usage of mobile sensors so as to keep them from being exploited by malicious attackers. However, despite the fact that various efforts have been put in achieving this, i.e., detecting privacy leaks in Android apps, we have not yet found approaches to automatically detect sensor leaks in Android apps. To fill the gap, we designed and implemented a novel prototype tool, SEEKER, that extends the famous FlowDroid tool to detect sensor-based data leaks in Android apps. SEEKER conducts sensor-focused static taint analyses directly on the Android apps' bytecode and reports not only sensor-triggered privacy leaks but also the sensor types involved in the leaks. Experimental results using over 40,000 real-world Android apps show that SEEKER is effective in detecting sensor leaks in Android apps, and malicious apps are more interested in leaking sensor data than benign apps.
SEJan 15, 2022
How are Diverse End-user Human-centric Issues Discussed on GitHub?Hourieh Khalajzadeh, Mojtaba Shahin, Humphrey O. Obie et al.
Many software systems fail to meet the needs of the diverse end-users in society and are prone to pose problems, such as accessibility and usability issues. Some of these problems (partially) stem from the failure to consider the characteristics, limitations, and abilities of diverse end-users during software development. We refer to this class of problems as human-centric issues. Despite their importance, there is a limited understanding of the types of human-centric issues encountered by developers. In-depth knowledge of these human-centric issues is needed to design software systems that better meet their diverse end-users' needs. This paper aims to provide insights for the software development and research communities on which human-centric issues are a topic of discussion for developers on GitHub. We conducted an empirical study by extracting and manually analysing 1,691 issue comments from 12 diverse projects, ranging from small to large-scale projects, including projects designed for challenged end-users, e.g., visually impaired and dyslexic users. Our analysis shows that eight categories of human-centric issues are discussed by developers. These include Inclusiveness, Privacy & Security, Compatibility, Location & Language, Preference, Satisfaction, Emotional Aspects, and Accessibility. Guided by our findings, we highlight some implications and possible future paths to further understand and incorporate human-centric issues in software development to be able to design software that meets the needs of diverse end users in society.
SEOct 5, 2021
Does Domain Change the Opinion of Individuals on Human Values? A Preliminary Investigation on eHealth Apps End-usersHumphrey Obie, Mojtaba Shahin, John Grundy et al.
The elicitation of end-users' human values - such as freedom, honesty, transparency, etc. - is important in the development of software systems. We carried out two preliminary Q-studies to understand (a) the general human value opinion types of eHealth applications (apps) end-users (b) the eHealth domain human value opinion types of eHealth apps end-users (c) whether there are differences between the general and eHealth domain opinion types. Our early results show three value opinion types using generic value instruments: (1) fun-loving, success-driven and independent end-user, (2) security-conscious, socially-concerned, and success-driven end-user, and (3) benevolent, success-driven, and conformist end-user Our results also show two value opinion types using domain-specific value instruments: (1) security-conscious, reputable, and honest end-user, and (2) success-driven, reputable and pain-avoiding end-user. Given these results, consideration should be given to domain context in the design and application of values elicitation instruments.
SESep 24, 2021
A Model-Driven Approach to Reengineering Processes in Cloud ComputingMahdi Fahmideh, John Grundy, Ghassan Beydoun et al.
The reengineering process of large data-intensive legacy software applications to cloud platforms involves different interrelated activities. These activities are related to planning, architecture design, re-hosting/lift-shift, code refactoring, and other related ones. In this regard, the cloud computing literature has seen the emergence of different methods with a disparate point of view of the same underlying legacy application reengineering process to cloud platforms. As such, the effective interoperability and tailoring of these methods become problematic due to the lack of integrated and consistent standard models.
HCSep 20, 2021
Latexify Math: Mathematical Formula Markup Revision to Assist Collaborative Editing in Math Q&A SitesSuyu Ma, Chunyang Chen, Hourieh Khalajzadeh et al.
Collaborative editing questions and answers plays an important role in quality control of Mathematics Stack Exchange which is a math Q&A Site. Our study of post edits in Mathematics Stack Exchange shows that there is a large number of math-related edits about latexifying formulas, revising LaTeX and converting the blurred math formula screenshots to LaTeX sequence. Despite its importance, manually editing one math-related post especially those with complex mathematical formulas is time-consuming and error-prone even for experienced users. To assist post owners and editors to do this editing, we have developed an edit-assistance tool, MathLatexEdit for formula latexification, LaTeX revision and screenshot transcription. We formulate this formula editing task as a translation problem, in which an original post is translated to a revised post. MathLatexEdit implements a deep learning based approach including two encoder-decoder models for textual and visual LaTeX edit recommendation with math-specific inference. The two models are trained on large-scale historical original-edited post pairs and synthesized screenshot-formula pairs. Our evaluation of MathLatexEdit not only demonstrates the accuracy of our model, but also the usefulness of MathLatexEdit in editing real-world posts which are accepted in Mathematics Stack Exchange.
SESep 16, 2021
The Effects of Human Aspects on the Requirements Engineering Process: A Systematic Literature ReviewDulaji Hidellaarachchi, John Grundy, Rashina Hoda et al.
Requirements Engineering (RE) requires the collaboration of various roles in SE, such as requirements engineers, stakeholders and other developers, and it is thus a highly human dependent process in software engineering (SE). Identifying how human aspects such as personality, motivation, emotions, communication, gender, culture and geographic distribution might impact RE would assist us in better supporting successful RE. The main objective of this paper is to systematically review primary studies that have investigated the effects of various human aspects on RE. A systematic literature review (SLR) was conducted and identified 474 initial primary research studies. These were eventually filtered down to 74 relevant, high-quality primary studies. Among the studies, the effects of communication have been considered in many RE studies. Other human aspects such as personality, motivation and gender have mainly been investigated to date related to SE studies including RE as one phase. Findings show that studying more than one human aspect together is beneficial, as this reveals relationships between various human aspects and how they together impact the RE process. However, the majority of these studied combinations of human aspects are unique. From 56.8% of studies that identified the effects of human aspects on RE, 40.5% identified the positive impact, 30.9% negative, 26.2% identified both impacts whereas 2.3% mentioned that there was no impact. This implies that a variety of human aspects positively or negatively affects the RE process and a well-defined theoretical analysis on the effects of different human aspects on RE remains to be defined and practically evaluated. Findings of this SLR help researchers who are investigating the impact of various human aspects on RE by identifying well-studied research areas, and highlight new areas that should be focused on in future research.
SESep 16, 2021
The Influence of Human Aspects on Requirements Engineering-related Activities: Software Practitioners PerspectiveDulaji Hidellaarachchi, John Grundy, Rashina Hoda et al.
Requirements Engineering (RE)-related activities require high collaboration between various roles in software engineering (SE), such as requirements engineers, stakeholders, developers, etc. Their demographics, views, understanding of technologies, working styles, communication and collaboration capabilities make RE highly human dependent. Identifying how "human aspects" such as motivation, domain knowledge, communication skills, personality, emotions, culture, etc. might impact RE-related activities would help us improve the RE and SE in general. This study aims to better understand current industry perspectives on the influence of human aspects on RE-related activities, specifically focusing on motivation and personality by targeting software practitioners involved in RE-related activities. Our findings indicate that software practitioners consider motivation, domain knowledge, attitude, communication skills and personality as highly important human aspects when involved in RE-related activities. A set of factors were identified as software practitioners motivational factors when involved in RE-related activities and identified important personality characteristics to have when involved in RE. We also identified factors that made individuals less effective when involved in RE-related activities and obtained an initial idea on measuring individuals performance when involved in RE. The findings from our study suggest various areas needing more investigation, and we summarise a set of key recommendations for further research.
SESep 9, 2021
The Emotional Roller Coaster of Responding to Requirements Changes in Software EngineeringKashumi Madampe, Rashina Hoda, John Grundy
Background: A preliminary study we conducted showed that software practitioners respond to requirements changes(RCs) with different emotions, and that their emotions vary at stages of the RC handling life cycle, such as receiving, developing, and delivering RCs. Objective: We wanted to study more comprehensively how practitioners emotionally respond to RCs. Method: We conducted a world-wide survey with the participation of 201 software practitioners. In our survey, we used the Job-related Affective Well-being Scale (JAWS) and open-ended questions to capture participants emotions when handling RCs in their work and query about the different circumstances when they feel these emotions. We used a combined approach of statistical analysis, JAWS, and Socio-Technical Grounded Theory (STGT) for Data Analysis to analyse our survey data. Findings: We identified (1) emotional responses to RCs, i.e., the most common emotions felt by practitioners when handling RCs; (2) different stimuli -- such as the RC, the practitioner, team, manager, customer -- that trigger these emotions through their own different characteristics; (3)emotion dynamics, i.e., the changes in emotions during the project and RC handling life cycles; (4) distinct events where particular emotions are triggered:project milestones, and RC stages; (5) and time related matters that regulate the emotion dynamics. Conclusion: Practitioners are not pleased with receiving RCs all the time. Last minute RCs introduced closer to a deadline especially violate emotional well-being of practitioners. We present some practical recommendations for practitioners to follow, including a dual-purpose emotion-centric decision guide to help decide when to introduce or accept an RC, and some future key research directions.
SEAug 12, 2021
Automating the Removal of Obsolete TODO CommentsZhipeng Gao, Xin Xia, David Lo et al.
TODO comments are very widely used by software developers to describe their pending tasks during software development. However, after performing the task developers sometimes neglect or simply forget to remove the TODO comment, resulting in obsolete TODO comments. These obsolete TODO comments can confuse development teams and may cause the introduction of bugs in the future, decreasing the software's quality and maintainability. In this work, we propose a novel model, named TDCleaner (TODO comment Cleaner), to identify obsolete TODO comments in software projects. TDCleaner can assist developers in just-in-time checking of TODO comments status and avoid leaving obsolete TODO comments. Our approach has two main stages: offline learning and online prediction. During offline learning, we first automatically establish <code_change, todo_comment, commit_msg> training samples and leverage three neural encoders to capture the semantic features of TODO comment, code change and commit message respectively. TDCleaner then automatically learns the correlations and interactions between different encoders to estimate the final status of the TODO comment. For online prediction, we check a TODO comment's status by leveraging the offline trained model to judge the TODO comment's likelihood of being obsolete. We built our dataset by collecting TODO comments from the top-10,000 Python and Java Github repositories and evaluated TDCleaner on them. Extensive experimental results show the promising performance of our model over a set of benchmarks. We also performed an in-the-wild evaluation with real-world software projects, we reported 18 obsolete TODO comments identified by TDCleaner to Github developers and 9 of them have already been confirmed and removed by the developers, demonstrating the practical usage of our approach.
SEAug 12, 2021
Operationalizing Human Values in Software Engineering: A SurveyMojtaba Shahin, Waqar Hussain, Arif Nurwidyantoro et al.
Human values (e.g., pleasure, privacy, and social justice) are what a person or a society considers important. The inability to address them in software-intensive systems can result in numerous undesired consequences (e.g., financial losses) for individuals and communities. Various solutions (e.g., methodologies, techniques) are developed to help "operationalize values in software". The ultimate goal is to ensure building software (better) reflects and respects human values. In this survey, "operationalizing values" is referred to as the process of identifying human values and translating them to accessible and concrete concepts so that they can be implemented, validated, verified, and measured in software. This paper provides a deep understanding of the research landscape on operationalizing values in software engineering, covering 51 primary studies. It also presents an analysis and taxonomy of 51 solutions for operationalizing values in software engineering. Our survey reveals that most solutions attempt to help operationalize values in the early phases (requirements and design) of the software development life cycle. However, the later phases (implementation and testing) and other aspects of software development (e.g., "team organization") still need adequate consideration. We outline implications for research and practice and identify open issues and future research directions to advance this area.
SEMay 5, 2021
Emotimonitor: A Trello Power-Up to Capture Emotions of Agile TeamsMohammed-Amr Abd El-Migid, Damon Cai, Thomas Niven et al.
In recent years, Agile methods have continued to grow into a popular means of modulating team productivity, even garnering a presence in non-software development related industries. The uptake of Agile methods has been driven by their flexibility, making them more suitable for many teams when compared to traditional approaches. However, an inevitable expectation for an Agile workflow is a higher level of change and uncertainty regarding requirements and tasks, which can ultimately have impacts on team member emotional states. The extent of such emotion impacts has motivated our research into the manner in which emotional states evolve in an Agile setting, along with whether such emotions can be accurately measured. To this end, we have developed Emotimonitor, a Trello power-up designed to capture information on emotions of team members as they relate to their technical tasks through a user-friendly interface. Emotimonitor will better enable team members to express their emotional states through emoji reactions on Trello cards, while also providing team leaders with a dashboard summarising these reactions as visualisations and statistical data. It is extensible and potentially provides an outlet for team members operating in Agile environments to better express their emotional states.
SEMay 5, 2021
Engineering Blockchain Based Software Systems: Foundations, Survey, and Future DirectionsMahdi Fahmideh, John Grundy, Aakash Ahmed et al.
Many scientific and practical areas have shown increasing interest in reaping the benefits of blockchain technology to empower software systems. However, the unique characteristics and requirements associated with Blockchain Based Software (BBS) systems raise new challenges across the development lifecycle that entail an extensive improvement of conventional software engineering. This article presents a systematic literature review of the state-of-the-art in BBS engineering research from a software engineering perspective. We characterize BBS engineering from the theoretical foundations, processes, models, and roles and discuss a rich repertoire of key development activities, principles, challenges, and techniques. The focus and depth of this survey not only gives software engineering practitioners and researchers a consolidated body of knowledge about current BBS development but also underpins a starting point for further research in this field.
SEApr 3, 2021
Human-Centric Issues in eHealth App Development and Usage: A Preliminary AssessmentMd. Shamsujjoha, John Grundy, Li Li et al.
Health-related mobile applications are known as eHealth apps. These apps make people more aware of their health, help during critical situations, provide home-based disease management, and monitor/support personalized care through sensing/interaction. eHealth app usage is rapidly increasing with a large number of new apps being developed. Unfortunately, many eHealth apps do not successfully adopt Human-Centric Issues (HCI) in the app development process and its deployment stages, leading them to become ineffective and not inclusive of diverse end-users. This paper provides an initial assessment of key human factors related to eHealth apps by literature review, existing guidelines analysis, and user studies. Preliminary results suggest that Usability, Accessibility, Reliability, Versatility, and User Experience are essential HCIs for eHealth apps, and need further attention from researchers and practitioners. Therefore, outcomes of this research will look to amend support for users, developers, and stakeholders of eHealth apps in the form of improved actionable guidelines, best practice examples, and evaluation techniques. The research also aims to trial the proposed solutions on real-world projects.
SEMar 22, 2021
Checking App Behavior Against App Descriptions: What If There are No App Descriptions?Md. Shamsujjoha, John Grundy, Li Li et al.
Classifying mobile apps based on their description is beneficial for several purposes. However, many app descriptions do not reflect app functionalities, whether accidentally or on purpose. Most importantly, these app classification methods do not work if the app description is unavailable. This paper investigates a Reverse Engineering-based Approach to Classify mobile apps using The data that exists in the app, called REACT. To validate the proposed REACT method, we use a large set of Android apps (24,652 apps in total). We also show REACTs' extendibility for malware/anomaly detection and prove its reliability and scalability. However, our analysis shows some limitations in REACT procedure and implementation, especially for similar feature based app grouping. We discuss the root cause of these failures, our key lessons learned, and some future enhancement ideas. We also share our REACT tools and reproduced datasets for the app market analyst, mobile app developers and software engineering research communities for further research purposes.
SEMar 16, 2021
Accessibility in Software Practice: A Practitioner's PerspectiveTingting Bi, Xin Xia, David Lo et al.
Being able to access software in daily life is vital for everyone, and thus accessibility is a fundamental challenge for software development. However, given the number of accessibility issues reported by many users, e.g., in app reviews, it is not clear if accessibility is widely integrated into current software projects and how software projects address accessibility issues. In this paper, we report a study of the critical challenges and benefits of incorporating accessibility into software development and design. We applied a mixed qualitative and quantitative approach for gathering data from 15 interviews and 365 survey respondents from 26 countries across five continents to understand how practitioners perceive accessibility development and design in practice. We got 44 statements grouped into eight topics on accessibility from practitioners' viewpoints and different software development stages. Our statistical analysis reveals substantial gaps between groups, e.g., practitioners have Direct v.s. Indirect accessibility relevant work experience when they reviewed the summarized statements. These gaps might hinder the quality of accessibility development and design, and we use our findings to establish a set of guidelines to help practitioners be aware of accessibility challenges and benefit factors. We also propose some remedies to resolve the gaps and to highlight key future research directions.
SEMar 12, 2021
Wireframe-Based UI Design Search Through Image AutoencoderJieshan Chen, Chunyang Chen, Zhenchang Xing et al.
UI design is an integral part of software development. For many developers who do not have much UI design experience, exposing them to a large database of real-application UI designs can help them quickly build up a realistic understanding of the design space for a software feature and get design inspirations from existing applications. However, existing keyword-based, image-similarity-based, and component-matching-based methods cannot reliably find relevant high-fidelity UI designs in a large database alike to the UI wireframe that the developers sketch, in face of the great variations in UI designs. In this article, we propose a deep-learning-based UI design search engine to fill in the gap. The key innovation of our search engine is to train a wireframe image autoencoder using a large database of real-application UI designs, without the need for labeling relevant UI designs. We implement our approach for Android UI design search, and conduct extensive experiments with artificially created relevant UI designs and human evaluation of UI design search results. Our experiments confirm the superior performance of our search engine over existing image-similarity or component-matching-based methods and demonstrate the usefulness of our search engine in real-world UI design tasks.
SEFeb 24, 2021
Practitioners' Perceptions of the Goals and Visual Explanations of Defect Prediction ModelsJirayus Jiarpakdee, Chakkrit Tantithamthavorn, John Grundy
Software defect prediction models are classifiers that are constructed from historical software data. Such software defect prediction models have been proposed to help developers optimize the limited Software Quality Assurance (SQA) resources and help managers develop SQA plans. Prior studies have different goals for their defect prediction models and use different techniques for generating visual explanations of their models. Yet, it is unclear what are the practitioners' perceptions of (1) these defect prediction model goals, and (2) the model-agnostic techniques used to visualize these models. We conducted a qualitative survey to investigate practitioners' perceptions of the goals of defect prediction models and the model-agnostic techniques used to generate visual explanations of defect prediction models. We found that (1) 82%-84% of the respondents perceived that the three goals of defect prediction models are useful; (2) LIME is the most preferred technique for understanding the most important characteristics that contributed to a prediction of a file, while ANOVA/VarImp is the second most preferred technique for understanding the characteristics that are associated with software defects in the past. Our findings highlight the significance of investigating how to improve the understanding of defect prediction models and their predictions. Hence, model-agnostic techniques from explainable AI domain may help practitioners to understand defect prediction models and their predictions.
SEFeb 21, 2021
Software Engineering for Internet of Things: The Practitioner's PerspectiveMahdi Fahmideh, Aakash Ahmed, Ali Behnaz et al.
Internet of Things based systems (IoT systems for short) are becoming increasingly popular across different industrial domains and their development is rapidly increasing to provide value-added services to end-users and citizens. Little research to date uncovers the core development process lifecycle needed for IoT systems, and thus software engineers find themselves unprepared and unfamiliar with this new genre of system development. To ameliorate this gap, we conducted a mixed quantitative and qualitative research study where we derived a conceptual process framework from the extant literature on IoT, that identifies 27 key tasks for incorporating into development processes for IoT systems. The framework was then validated by means of a survey of 127 IoT systems practitioners developers from 35 countries across 6 continents with 15 different industry backgrounds. Our research provides an understanding of the most important development process tasks and informs both software engineering practitioners and researchers of the challenges and recommendations related to the development of next generation of IoT systems.
SEFeb 19, 2021
SQAPlanner: Generating Data-Informed Software Quality Improvement PlansDilini Rajapaksha, Chakkrit Tantithamthavorn, Jirayus Jiarpakdee et al.
Software Quality Assurance (SQA) planning aims to define proactive plans, such as defining maximum file size, to prevent the occurrence of software defects in future releases. To aid this, defect prediction models have been proposed to generate insights as the most important factors that are associated with software quality. Such insights that are derived from traditional defect models are far from actionable-i.e., practitioners still do not know what they should do or avoid to decrease the risk of having defects, and what is the risk threshold for each metric. A lack of actionable guidance and risk threshold can lead to inefficient and ineffective SQA planning processes. In this paper, we investigate the practitioners' perceptions of current SQA planning activities, current challenges of such SQA planning activities, and propose four types of guidance to support SQA planning. We then propose and evaluate our AI-Driven SQAPlanner approach, a novel approach for generating four types of guidance and their associated risk thresholds in the form of rule-based explanations for the predictions of defect prediction models. Finally, we develop and evaluate an information visualization for our SQAPlanner approach. Through the use of qualitative survey and empirical evaluation, our results lead us to conclude that SQAPlanner is needed, effective, stable, and practically applicable. We also find that 80% of our survey respondents perceived that our visualization is more actionable. Thus, our SQAPlanner paves a way for novel research in actionable software analytics-i.e., generating actionable guidance on what should practitioners do and not do to decrease the risk of having defects to support SQA planning.
SEJan 30, 2021
EdgeWorkflowReal: An Edge Computing based Workflow Execution Engine for Smart SystemsXuejun Li, Ran Ding, Xiao Liu et al.
Current cloud-based smart systems suffer from weaknesses such as high response latency, limited network bandwidth and the restricted computing power of smart end devices which seriously affect the system's QoS (Quality of Service). Recently, given its advantages of low latency, high bandwidth and location awareness, edge computing has become a promising solution for smart systems. However, the development of edge computing based smart systems is a very challenging job for software developers who do not have the skills for the creation of edge computing environments. The management of edge computing resources and computing tasks is also very challenging. Workflow technology has been widely used in smart systems to automate task and resource management, but there does not yet exist a real-world deployable edge computing based workflow execution engine. To fill this gap, we present EdgeWorkflowReal, an edge computing based workflow execution engine for smart systems. EdgeWorkflowReal supports: 1) automatic creation of a real edge computing environment according to user settings; 2) visualized modelling of edge workflow applications; and 3) automatic deployment, monitoring and performance evaluation of edge workflow applications in a smart system.
SEDec 26, 2020
Requirements of API Documentation: A Case Study into Computer Vision ServicesAlex Cummaudo, Rajesh Vasa, John Grundy et al.
Using cloud-based computer vision services is gaining traction, where developers access AI-powered components through familiar RESTful APIs, not needing to orchestrate large training and inference infrastructures or curate/label training datasets. However, while these APIs seem familiar to use, their non-deterministic run-time behaviour and evolution is not adequately communicated to developers. Therefore, improving these services' API documentation is paramount-more extensive documentation facilitates the development process of intelligent software. In a prior study, we extracted 34 API documentation artefacts from 21 seminal works, devising a taxonomy of five key requirements to produce quality API documentation. We extend this study in two ways. Firstly, by surveying 104 developers of varying experience to understand what API documentation artefacts are of most value to practitioners. Secondly, identifying which of these highly-valued artefacts are or are not well-documented through a case study in the emerging computer vision service domain. We identify: (i) several gaps in the software engineering literature, where aspects of API documentation understanding is/is not extensively investigated; and (ii) where industry vendors (in contrast) document artefacts to better serve their end-developers. We provide a set of recommendations to enhance intelligent software documentation for both vendors and the wider research community.
SEDec 18, 2020
A First Look at Human Values-Violation in App ReviewsHumphrey O. Obie, Waqar Hussain, Xin Xia et al.
Ubiquitous technologies such as mobile software applications (mobile apps) have a tremendous influence on the evolution of the social, cultural, economic, and political facets of life in society. Mobile apps fulfil many practical purposes for users including entertainment, transportation, financial management, etc. Given the ubiquity of mobile apps in the lives of individuals and the consequent effect of these technologies on society, it is essential to consider the relationship between human values and the development and deployment of mobile apps. The many negative consequences of violating human values such as privacy, fairness or social justice by technology have been documented in recent times. If we can detect these violations in a timely manner, developers can look to better address them. To understand the violation of human values in a range of common mobile apps, we analysed 22,119 app reviews from Google Play Store using natural language processing techniques. We base our values violation detection approach on a widely accepted model of human values; the Schwartz theory of basic human values. The results of our analysis show that 26.5% of the reviews contained text indicating user perceived violations of human values. We found that benevolence and self-direction were the most violated value categories, and conformity and tradition were the least violated categories. Our results also highlight the need for a proactive approach to the alignment of values amongst stakeholders and the use of app reviews as a valuable additional source for mining values requirements.
SEDec 7, 2020
A Multi-dimensional Study of Requirements Changes in Agile Software Development ProjectsKashumi Madampe, Rashina Hoda, John Grundy
Agile processes are now widely practiced by software engineering (SE) teams, and the agile manifesto claims that agile methods support responding to changes well. However, no study appears to have researched whether this is accurate in reality. Requirements changes (RCs) are inevitable in any software development environment, and we wanted to acquire a holistic picture of how RCs occur and are handled in agile SE teams in practice. We also wanted to know whether responding to changes is the only or a main reason for software teams to use agile in their projects. To do this we conducted a mixed-methods research study which comprised of interviews of 10 agile practitioners from New Zealand and Australia, a literature review, and an in-depth survey with the participation of 40 agile practitioners world-wide. Through this study we identified different types of RCs, their origination including reasons for origination, forms, sources, carriers, and events at which they originate, challenging nature, and finally whether agile helps to respond to changes or not. We also found that agile teams seem to be reluctant to accept RCs, and therefore, they use several mitigation strategies. Additionally, as they accept the RCs, they use a variety of techniques to handle them. Furthermore, we found that agile allowing better response to RCs is only a minor reason for practicing agile. Several more important reasons included being able to deliver the product in a shorter period and increasing team productivity. Practitioners stated this improves the agile team environment and thus are the real motivators for teams to practice agile. Finally, we provide a set of practical recommendations that can be used to better handle RCs effectively in agile software development environments.
SEDec 3, 2020
Explainable AI for Software EngineeringChakkrit Tantithamthavorn, Jirayus Jiarpakdee, John Grundy
Artificial Intelligence/Machine Learning techniques have been widely used in software engineering to improve developer productivity, the quality of software systems, and decision-making. However, such AI/ML models for software engineering are still impractical, not explainable, and not actionable. These concerns often hinder the adoption of AI/ML models in software engineering practices. In this article, we first highlight the need for explainable AI in software engineering. Then, we summarize three successful case studies on how explainable AI techniques can be used to address the aforementioned challenges by making software defect prediction models more practical, explainable, and actionable.
SENov 30, 2020
A Survey on Deep Learning for Software EngineeringYanming Yang, Xin Xia, David Lo et al.
In 2006, Geoffrey Hinton proposed the concept of training ''Deep Neural Networks (DNNs)'' and an improved model training method to break the bottleneck of neural network development. More recently, the introduction of AlphaGo in 2016 demonstrated the powerful learning ability of deep learning and its enormous potential. Deep learning has been increasingly used to develop state-of-the-art software engineering (SE) research tools due to its ability to boost performance for various SE tasks. There are many factors, e.g., deep learning model selection, internal structure differences, and model optimization techniques, that may have an impact on the performance of DNNs applied in SE. Few works to date focus on summarizing, classifying, and analyzing the application of deep learning techniques in SE. To fill this gap, we performed a survey to analyse the relevant studies published since 2006. We first provide an example to illustrate how deep learning techniques are used in SE. We then summarize and classify different deep learning techniques used in SE. We analyzed key optimization technologies used in these deep learning models, and finally describe a range of key research topics using DNNs in SE. Based on our findings, we present a set of current challenges remaining to be investigated and outline a proposed research road map highlighting key opportunities for future work.
SENov 4, 2020
Opportunities and Challenges in Code Search ToolsChao Liu, Xin Xia, David Lo et al.
Code search is a core software engineering task. Effective code search tools can help developers substantially improve their software development efficiency and effectiveness. In recent years, many code search studies have leveraged different techniques, such as deep learning and information retrieval approaches, to retrieve expected code from a large-scale codebase. However, there is a lack of a comprehensive comparative summary of existing code search approaches. To understand the research trends in existing code search studies, we systematically reviewed 81 relevant studies. We investigated the publication trends of code search studies, analyzed key components, such as codebase, query, and modeling technique used to build code search tools, and classified existing tools into focusing on supporting seven different search tasks. Based on our findings, we identified a set of outstanding challenges in existing studies and a research roadmap for future code search research.
SESep 30, 2020
RCM: Requirement Capturing Model for Automated Requirements FormalisationAya Zaki-Ismail, Mohamed Osama, Mohamed Abdelrazek et al.
Most existing automated requirements formalisation techniques require system engineers to (re)write their requirements using a set of predefined requirement templates with a fixed structure and known semantics to simplify the formalisation process. However, these techniques require understanding and memorising requirement templates, which are usually fixed format, limit requirements captured, and do not allow capture of more diverse requirements. To address these limitations, we need a reference model that captures key requirement details regardless of their structure, format or order. Then, using NLP techniques we can transform textual requirements into the reference model. Finally, using a suite of transformation rules we can then convert these requirements into formal notations. In this paper, we introduce the first and key step in this process, a Requirement Capturing Model (RCM) - as a reference model - to model the key elements of a system requirement regardless of their format, or order. We evaluated the robustness of the RCM model compared to 15 existing requirements representation approaches and a benchmark of 162 requirements. Our evaluation shows that RCM breakdowns support a wider range of requirements formats compared to the existing approaches. We also implemented a suite of transformation rules that transforms RCM-based requirements into temporal logic(s). In the future, we will develop NLP-based RCM extraction technique to provide end-to-end solution.
SEAug 19, 2020
Threshy: Supporting Safe Usage of Intelligent Web ServicesAlex Cummaudo, Scott Barnett, Rajesh Vasa et al.
Increased popularity of `intelligent' web services provides end-users with machine-learnt functionality at little effort to developers. However, these services require a decision threshold to be set which is dependent on problem-specific data. Developers lack a systematic approach for evaluating intelligent services and existing evaluation tools are predominantly targeted at data scientists for pre-development evaluation. This paper presents a workflow and supporting tool, Threshy, to help software developers select a decision threshold suited to their problem domain. Unlike existing tools, Threshy is designed to operate in multiple workflows including pre-development, pre-release, and support. Threshy is designed for tuning the confidence scores returned by intelligent web services and does not deal with hyper-parameter optimisation used in ML models. Additionally, it considers the financial impacts of false positives. Threshold configuration files exported by Threshy can be integrated into client applications and monitoring infrastructure. Demo: https://bit.ly/2YKeYhE.