Xavier Franch

SE
h-index37
43papers
1,116citations
Novelty20%
AI Score45

43 Papers

LGFeb 2, 2023
Energy Efficiency of Training Neural Network Architectures: An Empirical Study

Yinlena Xu, Silverio Martínez-Fernández, Matias Martinez et al.

The evaluation of Deep Learning models has traditionally focused on criteria such as accuracy, F1 score, and related measures. The increasing availability of high computational power environments allows the creation of deeper and more complex models. However, the computations needed to train such models entail a large carbon footprint. In this work, we study the relations between DL model architectures and their environmental impact in terms of energy consumed and CO$_2$ emissions produced during training by means of an empirical study using Deep Convolutional Neural Networks. Concretely, we study: (i) the impact of the architecture and the location where the computations are hosted on the energy consumption and emissions produced; (ii) the trade-off between accuracy and energy efficiency; and (iii) the difference on the method of measurement of the energy consumed using software-based and hardware-based tools.

SEJul 19, 2023
Towards green AI-based software systems: an architecture-centric approach (GAISSA)

Silverio Martínez-Fernández, Xavier Franch, Francisco Durán

Nowadays, AI-based systems have achieved outstanding results and have outperformed humans in different domains. However, the processes of training AI models and inferring from them require high computational resources, which pose a significant challenge in the current energy efficiency societal demand. To cope with this challenge, this research project paper describes the main vision, goals, and expected outcomes of the GAISSA project. The GAISSA project aims at providing data scientists and software engineers tool-supported, architecture-centric methods for the modelling and development of green AI-based systems. Although the project is in an initial stage, we describe the current research results, which illustrate the potential to achieve GAISSA objectives.

LGJul 7, 2023
Estimating Deep Learning energy consumption based on model architecture and training environment

Santiago del Rey, Luís Cruz, Xavier Franch et al.

To raise awareness of the environmental impact of deep learning (DL), many studies estimate the energy use of DL systems. However, energy estimates during DL training often rely on unverified assumptions. This work addresses that gap by investigating how model architecture and training environment affect energy consumption. We train a variety of computer vision models and collect energy consumption and accuracy metrics to analyze their trade-offs across configurations. Our results show that selecting the right model-training environment combination can reduce training energy consumption by up to 80.68% with less than 2% loss in $F_1$ score. We find a significant interaction effect between model and training environment: energy efficiency improves when GPU computational power scales with model complexity. Moreover, we demonstrate that common estimation practices, such as using FLOPs or GPU TDP, fail to capture these dynamics and can lead to substantial errors. To address these shortcomings, we propose the Stable Training Epoch Projection (STEP) and the Pre-training Regression-based Estimation (PRE) methods. Across evaluations, our methods outperform existing tools by a factor of two or more in estimation accuracy.

CLAug 2, 2024
Leveraging Encoder-only Large Language Models for Mobile App Review Feature Extraction

Quim Motger, Alessio Miaschi, Felice Dell'Orletta et al.

Mobile app review analysis presents unique challenges due to the low quality, subjective bias, and noisy content of user-generated documents. Extracting features from these reviews is essential for tasks such as feature prioritization and sentiment analysis, but it remains a challenging task. Meanwhile, encoder-only models based on the Transformer architecture have shown promising results for classification and information extraction tasks for multiple software engineering processes. This study explores the hypothesis that encoder-only large language models can enhance feature extraction from mobile app reviews. By leveraging crowdsourced annotations from an industrial context, we redefine feature extraction as a supervised token classification task. Our approach includes extending the pre-training of these models with a large corpus of user reviews to improve contextual understanding and employing instance selection techniques to optimize model fine-tuning. Empirical evaluations demonstrate that this method improves the precision and recall of extracted features and enhances performance efficiency. Key contributions include a novel approach to feature extraction, annotated datasets, extended pre-trained models, and an instance selection mechanism for cost-effective fine-tuning. This research provides practical methods and empirical evidence in applying large language models to natural language processing tasks within mobile app reviews, offering improved performance in feature extraction.

LGJul 3, 2024
The More the Merrier? Navigating Accuracy vs. Energy Efficiency Design Trade-Offs in Ensemble Learning Systems

Rafiullah Omar, Justus Bogner, Henry Muccini et al.

Background: Machine learning (ML) model composition is a popular technique to mitigate shortcomings of a single ML model and to design more effective ML-enabled systems. While ensemble learning, i.e., forwarding the same request to several models and fusing their predictions, has been studied extensively for accuracy, we have insufficient knowledge about how to design energy-efficient ensembles. Objective: We therefore analyzed three types of design decisions for ensemble learning regarding a potential trade-off between accuracy and energy consumption: a) ensemble size, i.e., the number of models in the ensemble, b) fusion methods (majority voting vs. a meta-model), and c) partitioning methods (whole-dataset vs. subset-based training). Methods: By combining four popular ML algorithms for classification in different ensembles, we conducted a full factorial experiment with 11 ensembles x 4 datasets x 2 fusion methods x 2 partitioning methods (176 combinations). For each combination, we measured accuracy (F1-score) and energy consumption in J (for both training and inference). Results: While a larger ensemble size significantly increased energy consumption (size 2 ensembles consumed 37.49% less energy than size 3 ensembles, which in turn consumed 26.96% less energy than the size 4 ensembles), it did not significantly increase accuracy. Furthermore, majority voting outperformed meta-model fusion both in terms of accuracy (Cohen's d of 0.38) and energy consumption (Cohen's d of 0.92). Lastly, subset-based training led to significantly lower energy consumption (Cohen's d of 0.91), while training on the whole dataset did not increase accuracy significantly. Conclusions: From a Green AI perspective, we recommend designing ensembles of small size (2 or maximum 3 models), using subset-based training, majority voting, and energy-efficient ML algorithms like decision trees, Naive Bayes, or KNN.

SEJul 8, 2022
Guiding the retraining of convolutional neural networks against adversarial inputs

Francisco Durán López, Silverio Martínez-Fernández, Michael Felderer et al.

Background: When using deep learning models, there are many possible vulnerabilities and some of the most worrying are the adversarial inputs, which can cause wrong decisions with minor perturbations. Therefore, it becomes necessary to retrain these models against adversarial inputs, as part of the software testing process addressing the vulnerability to these inputs. Furthermore, for an energy efficient testing and retraining, data scientists need support on which are the best guidance metrics and optimal dataset configurations. Aims: We examined four guidance metrics for retraining convolutional neural networks and three retraining configurations. Our goal is to improve the models against adversarial inputs regarding accuracy, resource utilization and time from the point of view of a data scientist in the context of image classification. Method: We conducted an empirical study in two datasets for image classification. We explore: (a) the accuracy, resource utilization and time of retraining convolutional neural networks by ordering new training set by four different guidance metrics (neuron coverage, likelihood-based surprise adequacy, distance-based surprise adequacy and random), (b) the accuracy and resource utilization of retraining convolutional neural networks with three different configurations (from scratch and augmented dataset, using weights and augmented dataset, and using weights and only adversarial inputs). Results: We reveal that retraining with adversarial inputs from original weights and by ordering with surprise adequacy metrics gives the best model w.r.t. the used metrics. Conclusions: Although more studies are necessary, we recommend data scientists to use the above configuration and metrics to deal with the vulnerability to adversarial inputs of deep learning models, as they can improve their models against adversarial inputs without using many inputs.

SENov 22, 2023
Analyzing the Evolution and Maintenance of ML Models on Hugging Face

Joel Castaño, Silverio Martínez-Fernández, Xavier Franch et al.

Hugging Face (HF) has established itself as a crucial platform for the development and sharing of machine learning (ML) models. This repository mining study, which delves into more than 380,000 models using data gathered via the HF Hub API, aims to explore the community engagement, evolution, and maintenance around models hosted on HF, aspects that have yet to be comprehensively explored in the literature. We first examine the overall growth and popularity of HF, uncovering trends in ML domains, framework usage, authors grouping and the evolution of tags and datasets used. Through text analysis of model card descriptions, we also seek to identify prevalent themes and insights within the developer community. Our investigation further extends to the maintenance aspects of models, where we evaluate the maintenance status of ML models, classify commit messages into various categories (corrective, perfective, and adaptive), analyze the evolution across development stages of commits metrics and introduce a new classification system that estimates the maintenance status of models based on multiple attributes. This study aims to provide valuable insights about ML model maintenance and evolution that could inform future model development strategies on platforms like HF.

17.9SEMar 26
Opportunities and Limitations of GenAI in RE: Viewpoints from Practice

Anne Hess, Andreas Vogelsang, Xavier Franch et al.

Context and motivation: With the rapid advancement of AI technologies, there is an increasing need to understand how AI can be effectively integrated into RE processes. In recent years, several studies have explored the potential and challenges of applying GenAI to support or even automate RE-related activities. Question/problem: Despite the existing body of knowledge on AI's potential for supporting RE activities, there is limited evidence on its practical applicability and limitations from an industry perspective. Principal ideas/results: To address this gap, we conducted a survey with RE practitioners in collaboration with the IREB Special Interest Group on AI & RE. In addition to describing our research methodology and survey design, we present insights from our quantitative and qualitative data analyzes. These insights include practitioners' perspectives on current usage scenarios, concerns, experiences-both positive and negative-as well as training needs related to using GenAI in requirements elicitation, analysis, specification, validation, and management. Contribution: This study provides empirical evidence on the practical use of GenAI in RE, offering insights into its benefits, challenges, and training needs. The findings inform future research and industry strategies, guiding effective AI integration and skill development for improved RE processes and results.

20.7SEApr 15
Characterizing Datasets for LLM-based Requirements Engineering: A Systematic Mapping Study

Quim Motger, Carlota Catot, Xavier Franch

Large Language Models (LLMs) depend on high-quality, domain-specific natural language datasets. This dependency is particularly pronounced in Requirements Engineering (RE), where core activities rely on textual artifacts such as requirements, specifications, and stakeholder feedback. Despite the increasing use of LLMs in RE, data scarcity remains a widely reported limitation. While several datasets support LLM-based RE research, they are scattered across studies and lack systematic characterization, hindering reuse, comparability and assessment. This paper addresses this gap by examining which public datasets are used in LLM-based RE, how they can be consistently characterized, and which RE tasks and dataset properties remain under-represented. We report on a systematic mapping study of 45 primary studies referencing 62 publicly available datasets. Each dataset is characterized using a structured scheme covering multiple dimensions, including relevant descriptors such as artifact type, granularity, RE activity, supported task, application domain, and language, among others. The results reveal notable imbalances, including an incomplete adoption of open-science practices, limited dataset support for elicitation activities, and a lack of language and socio-technical diversity. The resulting catalogue and characterisation scheme support informed dataset selection, comparison, and reuse, contributing to stronger empirical foundations for LLM-based RE research and evaluation.

LGSep 19, 2024
Impact of ML Optimization Tactics on Greener Pre-Trained ML Models

Alexandra González Álvarez, Joel Castaño, Xavier Franch et al.

Background: Given the fast-paced nature of today's technology, which has surpassed human performance in tasks like image classification, visual reasoning, and English understanding, assessing the impact of Machine Learning (ML) on energy consumption is crucial. Traditionally, ML projects have prioritized accuracy over energy, creating a gap in energy consumption during model inference. Aims: This study aims to (i) analyze image classification datasets and pre-trained models, (ii) improve inference efficiency by comparing optimized and non-optimized models, and (iii) assess the economic impact of the optimizations. Method: We conduct a controlled experiment to evaluate the impact of various PyTorch optimization techniques (dynamic quantization, torch.compile, local pruning, and global pruning) to 42 Hugging Face models for image classification. The metrics examined include GPU utilization, power and energy consumption, accuracy, time, computational complexity, and economic costs. The models are repeatedly evaluated to quantify the effects of these software engineering tactics. Results: Dynamic quantization demonstrates significant reductions in inference time and energy consumption, making it highly suitable for large-scale systems. Additionally, torch.compile balances accuracy and energy. In contrast, local pruning shows no positive impact on performance, and global pruning's longer optimization times significantly impact costs. Conclusions: This study highlights the role of software engineering tactics in achieving greener ML models, offering guidelines for practitioners to make informed decisions on optimization methods that align with sustainability goals.

SEJun 20, 2025Code
Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems

Matias Martinez, Xavier Franch

The rapid progress in Automated Program Repair (APR) has been driven by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a recent benchmark designed to evaluate LLM-based repair systems using real issues and pull requests mined from 12 popular open-source Python repositories. Its public leaderboards -- SWE-Bench Lite and SWE-Bench Verified -- have become central platforms for tracking progress and comparing solutions. However, because the submission process does not require detailed documentation, the architectural design and origin of many solutions remain unclear. In this paper, we present the first comprehensive study of all submissions to the SWE-Bench Lite (79 entries) and Verified (99 entries) leaderboards, analyzing 80 unique approaches across dimensions such as submitter type, product availability, LLM usage, and system architecture. Our findings reveal the dominance of proprietary LLMs (especially Claude 3.5), the presence of both agentic and non-agentic designs, and a contributor base spanning from individual developers to large tech companies.

SEJun 3, 2025Code
How do Pre-Trained Models Support Software Engineering? An Empirical Study in Hugging Face

Alexandra González, Xavier Franch, David Lo et al.

Open-Source Pre-Trained Models (PTMs) provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs. To address this gap, we derive a taxonomy encompassing 147 SE tasks and apply an SE-oriented classification to PTMs in a popular open-source ML repository, Hugging Face (HF). Our repository mining study began with a systematically gathered database of PTMs from the HF API, considering their model card descriptions and metadata, and the abstract of the associated arXiv papers. We confirmed SE relevance through multiple filtering steps: detecting outliers, identifying near-identical PTMs, and the use of Gemini 2.0 Flash, which was validated with five pilot studies involving three human annotators. This approach uncovered 2,205 SE PTMs. We find that code generation is the most common SE task among PTMs, primarily focusing on software implementation, while requirements engineering and software design activities receive limited attention. In terms of ML tasks, text generation dominates within SE PTMs. Notably, the number of SE PTMs has increased markedly since 2023 Q2. Our classification provides a solid foundation for future automated SE scenarios, such as the sampling and selection of suitable PTMs.

SENov 14, 2024Code
Towards a Classification of Open-Source ML Models and Datasets for Software Engineering

Alexandra González, Xavier Franch, David Lo et al.

Background: Open-Source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs. Aims: We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time. Method: We conducted a repository mining study. We started with a systematically gathered database of PTMs and datasets from the HF API. Our selection was refined by analyzing model and dataset cards and metadata, such as tags, and confirming SE relevance using Gemini 1.5 Pro. All analyses are replicable, with a publicly accessible replication package. Results: The most common SE task among PTMs and datasets is code generation, with a primary focus on software development and limited attention to software management. Popular PTMs and datasets mainly target software development. Among ML tasks, text generation is the most common in SE PTMs and datasets. There has been a marked increase in PTMs for SE since 2023 Q2. Conclusions: This study underscores the need for broader task coverage to enhance the integration of ML within SE practices.

SEFeb 11, 2024
Lessons Learned from Mining the Hugging Face Repository

Joel Castaño, Silverio Martínez-Fernández, Xavier Franch

The rapidly evolving fields of Machine Learning (ML) and Artificial Intelligence have witnessed the emergence of platforms like Hugging Face (HF) as central hubs for model development and sharing. This experience report synthesizes insights from two comprehensive studies conducted on HF, focusing on carbon emissions and the evolutionary and maintenance aspects of ML models. Our objective is to provide a practical guide for future researchers embarking on mining software repository studies within the HF ecosystem to enhance the quality of these studies. We delve into the intricacies of the replication package used in our studies, highlighting the pivotal tools and methodologies that facilitated our analysis. Furthermore, we propose a nuanced stratified sampling strategy tailored for the diverse HF Hub dataset, ensuring a representative and comprehensive analytical approach. The report also introduces preliminary guidelines, transitioning from repository mining to cohort studies, to establish causality in repository mining studies, particularly within the ML model of HF context. This transition is inspired by existing frameworks and is adapted to suit the unique characteristics of the HF model ecosystem. Our report serves as a guiding framework for researchers, contributing to the responsible and sustainable advancement of ML, and fostering a deeper understanding of the broader implications of ML models.

SEMay 1, 2025
Aggregating empirical evidence from data strategy studies: a case on model quantization

Santiago del Rey, Paulo Sérgio Medeiros dos Santos, Guilherme Horta Travassos et al.

Background: As empirical software engineering evolves, more studies adopt data strategies$-$approaches that investigate digital artifacts such as models, source code, or system logs rather than relying on human subjects. Synthesizing results from such studies introduces new methodological challenges. Aims: This study assesses the effects of model quantization on correctness and resource efficiency in deep learning (DL) systems. Additionally, it explores the methodological implications of aggregating evidence from empirical studies that adopt data strategies. Method: We conducted a research synthesis of six primary studies that empirically evaluate model quantization. We applied the Structured Synthesis Method (SSM) to aggregate the findings, which combines qualitative and quantitative evidence through diagrammatic modeling. A total of 19 evidence models were extracted and aggregated. Results: The aggregated evidence indicates that model quantization weakly negatively affects correctness metrics while consistently improving resource efficiency metrics, including storage size, inference latency, and GPU energy consumption$-$a manageable trade-off for many DL deployment contexts. Evidence across quantization techniques remains fragmented, underscoring the need for more focused empirical studies per technique. Conclusions: Model quantization offers substantial efficiency benefits with minor trade-offs in correctness, making it a suitable optimization strategy for resource-constrained environments. This study also demonstrates the feasibility of using SSM to synthesize findings from data strategy-based research.

SEJan 14, 2025
Addressing Quality Challenges in Deep Learning: The Role of MLOps and Domain Knowledge

Santiago del Rey, Adrià Medina, Xavier Franch et al.

Deep learning (DL) systems present unique challenges in software engineering, especially concerning quality attributes like correctness and resource efficiency. While DL models excel in specific tasks, engineering DL systems is still essential. The effort, cost, and potential diminishing returns of continual improvements must be carefully evaluated, as software engineers often face the critical decision of when to stop refining a system relative to its quality attributes. This experience paper explores the role of MLOps practices -- such as monitoring and experiment tracking -- in creating transparent and reproducible experimentation environments that enable teams to assess and justify the impact of design decisions on quality attributes. Furthermore, we report on experiences addressing the quality challenges by embedding domain knowledge into the design of a DL model and its integration within a larger system. The findings offer actionable insights into the benefits of domain knowledge and MLOps and the strategic consideration of when to limit further optimizations in DL projects to maximize overall system quality and reliability.

SEOct 24, 2025
Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification

Mohammad Amin Zadenoori, Vincenzo De Martino, Jacek Dabrowski et al.

[Context and motivation] Large language models (LLMs) show notable results in natural language processing (NLP) tasks for requirements engineering (RE). However, their use is compromised by high computational cost, data sharing risks, and dependence on external services. In contrast, small language models (SLMs) offer a lightweight, locally deployable alternative. [Question/problem] It remains unclear how well SLMs perform compared to LLMs in RE tasks in terms of accuracy. [Results] Our preliminary study compares eight models, including three LLMs and five SLMs, on requirements classification tasks using the PROMISE, PROMISE Reclass, and SecReq datasets. Our results show that although LLMs achieve an average F1 score of 2% higher than SLMs, this difference is not statistically significant. SLMs almost reach LLMs performance across all datasets and even outperform them in recall on the PROMISE Reclass dataset, despite being up to 300 times smaller. We also found that dataset characteristics play a more significant role in performance than model size. [Contribution] Our study contributes with evidence that SLMs are a valid alternative to LLMs for requirements classification, offering advantages in privacy, cost, and local deployability.

LGMay 18, 2023
Exploring the Carbon Footprint of Hugging Face's ML Models: A Repository Mining Study

Joel Castaño, Silverio Martínez-Fernández, Xavier Franch et al.

The rise of machine learning (ML) systems has exacerbated their carbon footprint due to increased capabilities and model sizes. However, there is scarce knowledge on how the carbon footprint of ML models is actually measured, reported, and evaluated. In light of this, the paper aims to analyze the measurement of the carbon footprint of 1,417 ML models and associated datasets on Hugging Face, which is the most popular repository for pretrained ML models. The goal is to provide insights and recommendations on how to report and optimize the carbon efficiency of ML models. The study includes the first repository mining study on the Hugging Face Hub API on carbon emissions. This study seeks to answer two research questions: (1) how do ML model creators measure and report carbon emissions on Hugging Face Hub?, and (2) what aspects impact the carbon emissions of training ML models? The study yielded several key findings. These include a stalled proportion of carbon emissions-reporting models, a slight decrease in reported carbon footprint on Hugging Face over the past 2 years, and a continued dominance of NLP as the main application domain. Furthermore, the study uncovers correlations between carbon emissions and various attributes such as model size, dataset size, and ML application domains. These results highlight the need for software measurements to improve energy reporting practices and promote carbon-efficient model development within the Hugging Face community. In response to this issue, two classifications are proposed: one for categorizing models based on their carbon emission reporting practices and another for their carbon efficiency. The aim of these classification proposals is to foster transparency and sustainable model development within the ML community.

SEOct 18, 2021
Use and Misuse of the Term Experiment in Mining Software Repositories Research

Claudia Ayala, Burak Turhan, Xavier Franch et al.

The significant momentum and importance of Mining Software Repositories (MSR) in Software Engineering (SE) has fostered new opportunities and challenges for extensive empirical research. However, MSR researchers seem to struggle to characterize the empirical methods they use into the existing empirical SE body of knowledge. This is especially the case of MSR experiments. To provide evidence on the special characteristics of MSR experiments and their differences with experiments traditionally acknowledged in SE so far, we elicited the hallmarks that differentiate an experiment from other types of empirical studies and characterized the hallmarks and types of experiments in MSR. We analyzed MSR literature obtained from a small-scale systematic mapping study to assess the use of the term experiment in MSR. We found that 19% of the papers claiming to be an experiment are indeed not an experiment at all but also observational studies, so they use the term in a misleading way. From the remaining 81% of the papers, only one of them refers to a genuine controlled experiment while the others stand for experiments with limited control. MSR researchers tend to overlook such limitations, compromising the interpretation of the results of their studies. We provide recommendations and insights to support the improvement of MSR experiments.

SEOct 7, 2021
How Tertiary Studies perform Quality Assessment of Secondary Studies in Software Engineering

Dolors Costal, Carles Farré, Xavier Franch et al.

Context: Tertiary studies are becoming increasingly popular in software engineering as an instrument to synthesise evidence on a research topic in a systematic way. In order to understand and contextualize their findings, it is important to assess the quality of the selected secondary studies. Objective: This paper aims to provide a state of the art on the assessment of secondary studies' quality as conducted in tertiary studies in the area of software engineering, reporting the frameworks used as instruments, the facets examined in these frameworks, and the purposes of the quality assessment. Method: We designed this study as a systematic mapping responding to four research questions derived from the objective above. We applied a rigorous search protocol over the Scopus digital library, resulting in 47 papers after application of inclusion and exclusion criteria. The extracted data was synthesised using content analysis. Results: A majority of tertiary studies perform quality assessment. It is not often used for excluding studies, but to support some kind of investigation. The DARE quality assessment framework is the most frequently used, with customizations in some cases to cover missing facets. We outline the first steps towards building a new framework to address the shortcomings identified. Conclusion: This paper is a step forward establishing a foundation for researchers in two different ways. As authors of tertiary studies, understanding the different possibilities in which they can perform quality assessment of secondary studies. As readers, having an instrument to understand the methodological rigor upon which tertiary studies may claim their findings.

LGSep 28, 2021
Which Design Decisions in AI-enabled Mobile Applications Contribute to Greener AI?

Roger Creus Castanyer, Silverio Martínez-Fernández, Xavier Franch

Background: The construction, evolution and usage of complex artificial intelligence (AI) models demand expensive computational resources. While currently available high-performance computing environments support well this complexity, the deployment of AI models in mobile devices, which is an increasing trend, is challenging. Mobile applications consist of environments with low computational resources and hence imply limitations in the design decisions during the AI-enabled software engineering lifecycle that balance the trade-off between the accuracy and the complexity of the mobile applications. Objective: Our objective is to systematically assess the trade-off between accuracy and complexity when deploying complex AI models (e.g. neural networks) to mobile devices, which have an implicit resource limitation. We aim to cover (i) the impact of the design decisions on the achievement of high-accuracy and low resource-consumption implementations; and (ii) the validation of profiling tools for systematically promoting greener AI. Method: This confirmatory registered report consists of a plan to conduct an empirical study to quantify the implications of the design decisions on AI-enabled applications performance and to report experiences of the end-to-end AI-enabled software engineering lifecycle. Concretely, we will implement both image-based and language-based neural networks in mobile applications to solve multiple image classification and text classification problems on different benchmark datasets. Overall, we plan to model the accuracy and complexity of AI-enabled applications in operation with respect to their design decisions and will provide tools for allowing practitioners to gain consciousness of the quantitative relationship between the design decisions and the green characteristics of study.

SESep 16, 2021
Inclusion and Exclusion Criteria in Software Engineering Tertiary Studies: A Systematic Mapping and Emerging Framework

Dolors Costal, Carles Farré, Xavier Franch et al.

Context: Tertiary studies in software engineering (TS@SE) are widely used to synthesise evidence on a research topic systematically. As part of their protocol, TS@SE define inclusion and exclusion criteria (IC/EC) aimed at selecting those secondary studies (SS) to be included in the analysis. Aims: To provide a state of the art on the definition and application of IC/EC in TS@SE, and from the results of this analysis, we outline an emerging framework, TSICEC, to be used by SE researchers. Method: To provide the state of the art, we conducted a systematic mapping (SM) combining automatic search and snowballing over the body of SE scientific literature, which led to 50 papers after application of our own IC/EC. The extracted data was synthesised using content analysis. The results were used to define a first version of TSICEC. Results: The SM resulted in a coding schema, and a thorough analysis of the selected papers on the basis of this coding. Our TSICEC framework includes guidelines for the definition of IC/EC in TS@SE. Conclusion: This paper is a step forward establishing a foundation for researchers in two ways. As authors, understanding the different possibilities to define IC/EC and apply them to select SS. As readers, having an instrument to understand the methodological rigor upon which TS@SE may claim their findings.

CLJun 21, 2021
Software-Based Dialogue Systems: Survey, Taxonomy and Challenges

Quim Motger, Xavier Franch, Jordi Marco

The use of natural language interfaces in the field of human-computer interaction is undergoing intense study through dedicated scientific and industrial research. The latest contributions in the field, including deep learning approaches like recurrent neural networks, the potential of context-aware strategies and user-centred design approaches, have brought back the attention of the community to software-based dialogue systems, generally known as conversational agents or chatbots. Nonetheless, and given the novelty of the field, a generic, context-independent overview on the current state of research of conversational agents covering all research perspectives involved is missing. Motivated by this context, this paper reports a survey of the current state of research of conversational agents through a systematic literature review of secondary studies. The conducted research is designed to develop an exhaustive perspective through a clear presentation of the aggregated knowledge published by recent literature within a variety of domains, research focuses and contexts. As a result, this research proposes a holistic taxonomy of the different dimensions involved in the conversational agents' field, which is expected to help researchers and to lay the groundwork for future research in the field of natural language interfaces.

SEMay 28, 2021
A Study about the Knowledge and Use of Requirements Engineering Standards in Industry

Xavier Franch, Martin Glinz, Daniel Mendez et al.

Context: The use of standards is considered a vital part of any engineering discipline. So one could expect that standards play an important role in Requirements Engineering (RE) as well. However, little is known about the actual knowledge and use of RE-related standards in industry. Objective: In this article, we investigate to which extent standards and related artifacts such as templates or guidelines are known and used by RE practitioners. Method: To this end, we have conducted a questionnaire-based online survey. We could analyze the replies from 90 RE practitioners using a combination of closed and open-text questions. Results: Our results indicate that the knowledge and use of standards and related artifacts in RE is less widespread than one might expect from an engineering perspective. For example, about 47% of the respondents working as requirements engineers or business analysts do not know the core standard in RE, ISO/IEC/IEEE 29148. Participants in our study mostly use standards by personal decision rather than being imposed by their respective company, customer, or regulator. Beyond insufficient knowledge, we also found cultural and organizational factors impeding the widespread adoption of standards in RE. Conclusions: Overall, our results provide empirically informed insights into the actual use of standards and related artifacts in RE practice and - indirectly - about the value that the current standards create for RE practitioners.

SEMay 5, 2021
Software Engineering for AI-Based Systems: A Survey

Silverio Martínez-Fernández, Justus Bogner, Xavier Franch et al.

AI-based systems are software systems with functionalities enabled by at least one AI component (e.g., for image- and speech-recognition, and autonomous driving). AI-based systems are becoming pervasive in society due to advances in AI. However, there is limited synthesized knowledge on Software Engineering (SE) approaches for building, operating, and maintaining AI-based systems. To collect and analyze state-of-the-art knowledge about SE for AI-based systems, we conducted a systematic mapping study. We considered 248 studies published between January 2010 and March 2020. SE for AI-based systems is an emerging research area, where more than 2/3 of the studies have been published since 2018. The most studied properties of AI-based systems are dependability and safety. We identified multiple SE approaches for AI-based systems, which we classified according to the SWEBOK areas. Studies related to software testing and software quality are very prevalent, while areas like software maintenance seem neglected. Data-related issues are the most recurrent challenges. Our results are valuable for: researchers, to quickly understand the state of the art and learn which topics need more research; practitioners, to learn about the approaches and challenges that SE entails for AI-based systems; and, educators, to bridge the gap among SE and AI in their curricula.

SEMar 19, 2021
Improving Web API Usage Logging

Rediana Koçi, Xavier Franch, Petar Jovanovic et al.

A Web API (WAPI) is a type of API whose interaction with its consumers is done through the Internet. While being accessed through the Internet can be challenging, mostly when WAPIs evolve, it gives providers the possibility to monitor their usage, and understand and analyze consumers' behavior. Currently, WAPI usage is mostly logged for traffic monitoring and troubleshooting. Even though they contain invaluable information regarding consumers' behavior} they are not sufficiently used by providers. In this paper, we first consider two phases of the application development lifecycle, and based on them we distinguish two different types of usage logs, namely development logs and production logs. For each of them we show the potential analyses (e.g., WAPI usability evaluation, consumers' needs identification) that can be performed, as well as the main impediments, that may be caused by the unsuitable log format. We then conduct a case study using logs of the same WAPI from different deployments and different formats, to demonstrate the occurrence of these impediments and at the same time the importance of a proper log format. Next, based on the case study results, we present the main quality issues of WAPI log data and explain their impact on data analyses. For each of them, we give some practical suggestions on how to deal with them, as well as mitigating their root cause.

LGMar 11, 2021
Integration of Convolutional Neural Networks in Mobile Applications

Roger Creus Castanyer, Silverio Martínez-Fernández, Xavier Franch

When building Deep Learning (DL) models, data scientists and software engineers manage the trade-off between their accuracy, or any other suitable success criteria, and their complexity. In an environment with high computational power, a common practice is making the models go deeper by designing more sophisticated architectures. However, in the context of mobile devices, which possess less computational power, keeping complexity under control is a must. In this paper, we study the performance of a system that integrates a DL model as a trade-off between the accuracy and the complexity. At the same time, we relate the complexity to the efficiency of the system. With this, we present a practical study that aims to explore the challenges met when optimizing the performance of DL models becomes a requirement. Concretely, we aim to identify: (i) the most concerning challenges when deploying DL-based software in mobile applications; and (ii) the path for optimizing the performance trade-off. We obtain results that verify many of the identified challenges in the related work such as the availability of frameworks and the software-data dependency. We provide a documentation of our experience when facing the identified challenges together with the discussion of possible solutions to them. Additionally, we implement a solution to the sustainability of the DL models when deployed in order to reduce the severity of other identified challenges. Moreover, we relate the performance trade-off to a new defined challenge featuring the impact of the complexity in the obtained accuracy. Finally, we discuss and motivate future work that aims to provide solutions to the more open challenges found.

SEFeb 23, 2021
The State-of-Practice in Requirements Elicitation: An Extended Interview Study at 12 Companies

Cristina Palomares, Xavier Franch, Carme Quer et al.

Context. Requirements engineering remains a discipline that is faced with a large number of challenges, including the implementation of a requirements elicitation process in industry. Although several proposals have been suggested by researchers and academics, little is known of the practices that are actually followed in industry. Objective. We investigate the SoTA with respect to requirements elicitation, examining practitioners' practices. We focus on the techniques, the roles involved, and the challenges associated to the process. Method. We conducted an interview-based survey study involving 24 practitioners from 12 different Swedish IT companies. Results. We found that group interaction techniques, including meetings and workshops, are the most popular type of elicitation techniques that are employed by the practitioners, except in the case of small projects. We noted that customers are frequently involved in the elicitation process, except in the case of market-driven organizations. Technical staff (for example, developers and architects) are more frequently involved in the elicitation process compared to the involvement of business- or strategic staff. Finally, we identified a number of challenges with respect to stakeholders. These challenges include difficulties in understanding and prioritizing their needs. Further, it was noted that requirements instability (i.e., caused by changing needs or priorities) was a predominant challenge. These observations need to be interpreted in the context of the study. Conclusion. The relevant observations regarding the survey participants' experiences should be of interest to the industry; experiences that should be analyzed in the practitioners' context. Researchers may find evidence for the use of academic results in practice, thereby inspiring future theoretical work, as well as further empirical studies in the same area.

SEFeb 16, 2021
Improved management of issue dependencies in issue trackers of large collaborative projects

Mikko Raatikainen, Quim Motger, Clara Marie Lüders et al.

Issue trackers, such as Jira, have become the prevalent collaborative tools in software engineering for managing issues, such as requirements, development tasks, and software bugs. However, issue trackers inherently focus on the lifecycle of single issues, although issues have and express dependencies on other issues that constitute issue dependency networks in large complex collaborative projects. The objective of this study is to develop supportive solutions for the improved management of dependent issues in an issue tracker. This study follows the Design Science methodology, consisting of eliciting drawbacks and constructing and evaluating a solution and system. The study was carried out in the context of The Qt Company's Jira, which exemplifies an actively used, almost two-decade-old issue tracker with over 100,000 issues. The drawbacks capture how users operate with issue trackers to handle issue information in large, collaborative, and long-lived projects. The basis of the solution is to keep issues and dependencies as separate objects and automatically construct an issue graph. Dependency detections complement the issue graph by proposing missing dependencies, while consistency checks and diagnoses identify conflicting issue priorities and release assignments. Jira's plugin and service-based system architecture realize the functional and quality concerns of the system implementation. We show how to adopt the intelligent supporting techniques of an issue tracker in a complex use context and a large data-set. The solution considers an integrated and holistic system view, practical applicability and utility, and the practical characteristics of issue data, such as inherent incompleteness.

SEFeb 11, 2021
QFL: Data-Driven Feedback Loop to Manage Quality in Agile Development

Lidia López, Alessandra Bagnato, Antonin Ahbervé et al.

Background: Quality requirements (QRs) describe desired system qualities, playing an important role in the success of software projects. In the context of agile software development (ASD), where the main objective is the fast delivery of functionalities, QRs are often ill-defined and not well addressed during the development process. Software analytics tools help to control quality though the measurement of quality-related software aspects to support decision-makers in the process of QR management. Aim: The goal of this research is to explore the benefits of integrating a concrete software analytics tool, Q-Rapids Tool, to assess software quality and support QR management processes. Method: In the context of a technology transfer project, the Softeam company has integrated Q-Rapids Tool in their development process. We conducted a series of workshops involving Softeam members working in the Modelio product development. Results: We present the Quality Feedback Loop (QFL) process to be integrated in software development processes to control the complete QR life-cycle, from elicitation to validation. As a result of the implementation of QFL in Softeam, Modelio's team members highlight the benefits of integrating a data analytics tool with their project planning tool and the fact that project managers can control the whole process making the final decisions. Conclusions: Practitioners can benefit from the integration of software analytics tools as part of their software development toolchain to control software quality. The implementation of QFL promotes quality in the organization and the integration of software analytics and project planning tools also improves the communication between teams.

SENov 10, 2020
How do Practitioners Perceive the Relevance of Requirements Engineering Research?

Xavier Franch, Daniel Mendez, Andreas Vogelsang et al.

The relevance of Requirements Engineering (RE) research to practitioners is vital for a long-term dissemination of research results to everyday practice. Some authors have speculated about a mismatch between research and practice in the RE discipline. However, there is not much evidence to support or refute this perception. This paper presents the results of a study aimed at gathering evidence from practitioners about their perception of the relevance of RE research and at understanding the factors that influence that perception. We conducted a questionnaire-based survey of industry practitioners with expertise in RE. The participants rated the perceived relevance of 435 scientific papers presented at five top RE-related conferences. The 153 participants provided a total of 2,164 ratings. The practitioners rated RE research as essential or worthwhile in a majority of cases. However, the percentage of non-positive ratings is still higher than we would like. Among the factors that affect the perception of relevance are the research's links to industry, the research method used, and respondents' roles. The reasons for positive perceptions were primarily related to the relevance of the problem and the soundness of the solution, while the causes for negative perceptions were more varied. The respondents also provided suggestions for future research, including topics researchers have studied for decades, like elicitation or requirement quality criteria.

SEMar 11, 2020
Developing and Operating Artificial Intelligence Models in Trustworthy Autonomous Systems

Silverio Martínez-Fernández, Xavier Franch, Andreas Jedlitschka et al.

Companies dealing with Artificial Intelligence (AI) models in Autonomous Systems (AS) face several problems, such as users' lack of trust in adverse or unknown conditions, gaps between software engineering and AI model development, and operation in a continuously changing operational environment. This work-in-progress paper aims to close the gap between the development and operation of trustworthy AI-based AS by defining an approach that coordinates both activities. We synthesize the main challenges of AI-based AS in industrial settings. We reflect on the research efforts required to overcome these challenges and propose a novel, holistic DevOps approach to put it into practice. We elaborate on four research directions: (a) increased users' trust by monitoring operational AI-based AS and identifying self-adaptation needs in critical situations; (b) integrated agile process for the development and evolution of AI models and AS; (c) continuous deployment of different context-specific instances of AI models in a distributed setting of AS; and (d) holistic DevOps-based lifecycle for AI-based AS.

SEFeb 6, 2020
Management of quality requirements in agile and rapid software development: A systematic mapping study

Woubshet Behutiye, Pertti Karhapää, Lidia Lopez et al.

Context:Quality requirements (QRs) describe the desired quality of software, and they play an important role in the success of software projects. In agile software development (ASD), QRs are often ill-defined and not well addressed due to the focus on quickly delivering functionality. Rapid software development (RSD) approaches (e.g., continuous delivery and continuous deployment), which shorten delivery times, are more prone to neglect QRs. Despite the significance of QRs in both ASD and RSD, there is limited synthesized knowledge on their management in those approaches. Objective:This study aims to synthesize state-of-the-art knowledge about QR management in ASD and RSD, focusing on three aspects: bibliometric, strategies, and challenges. Research method:Using a systematic mapping study with a snowballing search strategy, we identified and structured the literature on QR management in ASD and RSD. Check the PDF file to see the full abstract and document.

SEFeb 1, 2019
Do We Preach What We Practice? Investigating the Practical Relevance of Requirements Engineering Syllabi - The IREB Case

Daniel Méndez Fernández, Xavier Franch, Norbert Seyff et al.

Nowadays, there exist a plethora of different educational syllabi for Requirements Engineering (RE), all aiming at incorporating practically relevant educational units (EUs). Many of these syllabi are based, in one way or the other, on the syllabi provided by the International Requirements Engineering Board (IREB), a non-profit organisation devoted to standardised certification programs for RE. IREB syllabi are developed by RE experts and are, thus, based on the assumption that they address topics of practical relevance. However, little is known about to what extent practitioners actually perceive those contents as useful. We have started a study to investigate the relevance of the EUs included in the IREB Foundation Level certification programme. In a first phase reported in this paper, we have surveyed practitioners mainly from DACH countries (Germany, Austria and Switzerland) participating in the IREB certification. Later phases will widen the scope both by including other countries and by not requiring IREB-certified participants. The results shall foster a critical reflection on the practical relevance of EUs built upon the de-facto standard syllabus of IREB.

SESep 3, 2018
Adaptive Monitoring: A Systematic Mapping

Edith Zavala, Xavier Franch, Jordi Marco

Context: Adaptive monitoring is a method used in a variety of domains for responding to changing conditions. It has been applied in different ways, from monitoring systems' customization to re-composition, in different application domains. However, to the best of our knowledge, there are no studies analyzing how adaptive monitoring differs or resembles among the existing approaches. Method: We have conducted a systematic mapping study of adaptive monitoring approaches following recommended practices. We have applied automatic search and snowballing sampling on different sources and used rigorous selection criteria to retrieve the final set of papers. Moreover, we have used an existing qualitative analysis method for extracting relevant data from studies. Finally, we have applied data mining techniques for identifying patterns in the solutions. Conclusions: This cross-domain overview of the current state of the art on adaptive monitoring may be a solid and comprehensive baseline for researchers and practitioners in the field. Especially, it may help in identifying opportunities of research, for instance, the need of proposing generic and flexible software engineering solutions for supporting adaptive monitoring in a variety of systems.

SEAug 16, 2018
Towards Automated Data Integration in Software Analytics

Silverio Martínez-Fernández, Petar Jovanovic, Xavier Franch et al.

Software organizations want to be able to base their decisions on the latest set of available data and the real-time analytics derived from them. In order to support "real-time enterprise" for software organizations and provide information transparency for diverse stakeholders, we integrate heterogeneous data sources about software analytics, such as static code analysis, testing results, issue tracking systems, network monitoring systems, etc. To deal with the heterogeneity of the underlying data sources, we follow an ontology-based data integration approach in this paper and define an ontology that captures the semantics of relevant data for software analytics. Furthermore, we focus on the integration of such data sources by proposing two approaches: a static and a dynamic one. We first discuss the current static approach with a predefined set of analytic views representing software quality factors and further envision how this process could be automated in order to dynamically build custom user analysis using a semi-automatic platform for managing the lifecycle of analytics infrastructures.

SEAug 7, 2018
Needs and Challenges for a Platform to Support Large-scale Requirements Engineering. A Multiple Case Study

Davide Fucci, Cristina Palomares, Dolors Costal et al.

Background: Requirement engineering is often considered a critical activity in system development projects. The increasing complexity of software, as well as number and heterogeneity of stakeholders, motivate the development of methods and tools for improving large-scale requirement engineering. Aims: The empirical study presented in this paper aims to identify and understand the characteristics and challenges of a platform, as desired by experts, to support requirement engineering for individual stakeholders, based on the current pain-points of their organizations when dealing with a large number requirements. Method: We conducted a multiple case study with three companies in different domains. We collected data through ten semi-structured interviews with experts from these companies. Results: The main pain-point for stakeholders is handling the vast amount of data from different sources. The foreseen platform should leverage such data to manage changes in requirements according to customers' and users' preferences. It should also offer stakeholders an estimation of how long a requirements engineering task will take to complete, along with an easier requirements dependency identification and requirements reuse strategy. Conclusions: The findings provide empirical evidence about how practitioners wish to improve their requirement engineering processes and tools. The insights are a starting point for in-depth investigations into the problems and solutions presented. Practitioners can use the results to improve existing or design new practices and tools.

SEApr 10, 2018
Protocol and Tools for Conducting Agile Software Engineering Research in an Industrial-Academic Setting: A Preliminary Study

Katarzyna Biesialska, Xavier Franch, Victor Muntés-Mulero

Conducting empirical research in software engineering industry is a process, and as such, it should be generalizable. The aim of this paper is to discuss how academic researchers may address some of the challenges they encounter during conducting empirical research in the software industry by means of a systematic and structured approach. The protocol developed in this paper should serve as a practical guide for researchers and help them with conducting empirical research in this complex environment.

SEMar 5, 2018
SACRE: Supporting contextual requirements' adaptation in modern self-adaptive systems in the presence of uncertainty at runtime

Edith Zavala, Xavier Franch, Jordi Marco et al.

Runtime uncertainty such as unpredictable resource unavailability, changing environmental conditions and user needs, as well as system intrusions or faults represents one of the main current challenges of self-adaptive systems. Moreover, today's systems are increasingly more complex, distributed, decentralized, etc. and therefore have to reason about and cope with more and more unpredictable events. Approaches to deal with such changing requirements in complex today's systems are still missing. This work presents SACRE (Smart Adaptation through Contextual REquirements), our approach leveraging an adaptation feedback loop to detect self-adaptive systems' contextual requirements affected by uncertainty and to integrate machine learning techniques to determine the best operationalization of context based on sensed data at runtime. SACRE is a step forward of our former approach ACon which focus had been on adapting the context in contextual requirements, as well as their basic implementation. SACRE primarily focuses on architectural decisions, addressing self-adaptive systems' engineering challenges. Furthering the work on ACon, in this paper, we perform an evaluation of the entire approach in different uncertainty scenarios in real-time in the extremely demanding domain of smart vehicles. The real-time evaluation is conducted in a simulated environment in which the smart vehicle is implemented through software components. The evaluation results provide empirical evidence about the applicability of SACRE in real and complex software system domains.

SENov 24, 2017
Non-functional Requirements Documentation in Agile Software Development: Challenges and Solution Proposal

Woubshet Behutiye, Pertti Karhapää, Dolors Costal et al.

Non-functional requirements (NFRs) are determinant for the success of software projects. However,they are characterized as hard to define, and in agile software development(ASD), are often given less priority and usually not documented. In this paper, we present the findings of the documentation practices and challenges of NFRs in companies utilizing ASD and propose guidelines for enhancing NFRs documentation in ASD. We interviewed practitioners from four companies and identified that epics, features, user stories, acceptance criteria,Definition of Done(DoD), product and sprint backlogs are used for documenting NFRS. Please refer to the manuscript for the full abstract.

SEMay 25, 2016
iStar 2.0 Language Guide

Fabiano Dalpiaz, Xavier Franch, Jennifer Horkoff

The i* modeling language was introduced to fill the gap in the spectrum of conceptual modeling languages, focusing on the intentional (why?), social (who?), and strategic (how? how else?) dimensions. i* has been applied in many areas, e.g., healthcare, security analysis, eCommerce. Although i* has seen much academic application, the diversity of extensions and variations can make it difficult for novices to learn and use it in a consistent way. This document introduces the iStar 2.0 core language, evolving the basic concepts of i* into a consistent and clear set of core concepts, upon which to build future work and to base goal-oriented teaching materials. This document was built from a set of discussions and input from various members of the i* community. It is our intention to revisit, update and expand the document after collecting examples and concrete experiences with iStar 2.0.

SEJun 22, 2012
Linking Quality Attributes and Constraints with Architectural Decisions

David Ameller, Xavier Franch

Quality attributes and constraints are among the main drivers of architectural decision making. The quality attributes are improved or damaged by the architectural decisions, while restrictions directly include or exclude parts of the architecture (for example, the logical components or technologies). We can determine the impact of a decision of architecture in software quality, or which parts of the architecture are affected by a constraint, but the difficult problem is whether we are respecting the quality requirements (requirements on quality attributes) and constraints with all the architectural decisions made. Currently, the common practice is that architects use their own experience to design architectures that meet the quality requirements and restrictions, but at the end, especially for the crucial decisions, the architect has to deal with complex trade-offs between quality attributes and juggle possible incompatibilities raised by the constraints. In this paper we present Quark, a computer-aided method to support architects in software architecture decision making.