Ivano Malavolta

SE
h-index8
11papers
251citations
Novelty18%
AI Score40

11 Papers

SEJul 8, 2024Code
Ten Years of Teaching Empirical Software Engineering in the context of Energy-efficient Software

Ivano Malavolta, Vincenzo Stoico, Patricia Lago

In this chapter we share our experience in running ten editions of the Green Lab course at the Vrije Universiteit Amsterdam, the Netherlands. The course is given in the Software Engineering and Green IT track of the Computer Science Master program of the VU. The course takes place every year over a 2-month period and teaches Computer Science students the fundamentals of Empirical Software Engineering in the context of energy-efficient software. The peculiarity of the course is its research orientation: at the beginning of the course the instructor presents a catalog of scientifically relevant goals, and each team of students signs up for one of them and works together for 2 months on their own experiment for achieving the goal. Each team goes over the classic steps of an empirical study, starting from a precise formulation of the goal and research questions to context definition, selection of experimental subjects and objects, definition of experimental variables, experiment execution, data analysis, and reporting. Over the years, the course became well-known within the Software Engineering community since it led to several scientific studies that have been published at various scientific conferences and journals. Also, students execute their experiments using \textit{open-source tools}, which are developed and maintained by researchers and other students within the program, thus creating a virtuous community of learners where students exchange ideas, help each other, and learn how to collaboratively contribute to open-source projects in a safe environment.

SEMar 27
Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence

Eziyo Ehsani, Luca Giamattei, Ivano Malavolta et al.

The migration of Large Language Models (LLMs) from cloud clusters to edge devices promises enhanced privacy and offline accessibility, but this transition encounters a harsh reality: the physical constraints of mobile batteries, thermal limits, and, most importantly, memory constraints. To navigate this landscape, we constructed a reproducible experimental pipeline to profile the complex interplay between energy consumption, latency, and quality. Unlike theoretical studies, we captured granular power metrics across eight models ranging from 0.5B to 9B parameters without requiring root access, ensuring our findings reflect realistic user conditions. We harness this pipeline to conduct an empirical case study on a flagship Android device, the Samsung Galaxy S25 Ultra, establishing foundational hypotheses regarding the trade-offs between generation quality, performance, and resource consumption. Our investigation uncovered a counter-intuitive quantization-energy paradox. While modern importance-aware quantization successfully reduces memory footprints to fit larger models into RAM, we found it yields negligible energy savings compared to standard mixed-precision methods. This proves that for battery life, the architecture of the model, not its quantization scheme, is the decisive factor. We further identified that Mixture-of-Experts (MoE) architectures defy the standard size-energy trend, offering the storage capacity of a 7B model while maintaining the lower energy profile of a 1B to 2B model. Finally, an analysis of these multi-objective trade-offs reveals a pragmatic sweet spot of mid-sized models, such as Qwen2.5-3B, that effectively balance response quality with sustainable energy consumption.

SEApr 23
Can Large Language Models Assist the Comprehension of ROS2 Software Architectures?

Laura Duits, Bouazza El Moutaouakil, Ivano Malavolta

Context. The most used development framework for robotics software is ROS2. ROS2 architectures are highly complex, with thousands of components communicating in a decentralized fashion. Goal. We aim to evaluate how LLMs can assist in the comprehension of factual information about the architecture of ROS2 systems. Method. We conduct a controlled experiment where we administer 1,230 prompts to 9 LLMs containing architecturally-relevant questions about 3 ROS2 systems with incremental size. We provide a generic algorithm that systematically generates architecturally-relevant questions for a ROS2 system. Then, we (i) assess the accuracy of the answers of the LLMs against a ground truth established via running and monitoring the 3 ROS2 systems and (ii) qualitatively analyse the explanations provided by the LLMs. Results. Almost all questions are answered correctly across all LLMs (mean=98.22%). gemini-2.5-pro performs best (100% accuracy across all prompts and systems), followed by o3 (99.77%), and gemini-2.5-flash (99.72%); the least performing LLM is gpt-4.1 (95%). Only 300/1,230 prompts are incorrectly answered, of which 249 are about the most complex system. The coherence scores in LLM's explanations range from 0.394 for "service references" to 0.762 for "communication path". The mean perplexity varies significantly across models, with chatgpt-4o achieving the lowest score (19.6) and o4-mini the highest (103.6). Conclusions. There is great potential in the usage of LLMs to aid ROS2 developers in comprehending non-trivial aspects of the software architecture of their systems. Nevertheless, developers should be aware of the intrinsic limitations and different performances of the LLMs and take those into account when using them.

SESep 12, 2025
Generating Energy-Efficient Code via Large-Language Models -- Where are we now?

Radu Apsan, Vincenzo Stoico, Michel Albonico et al.

Context. The rise of Large Language Models (LLMs) has led to their widespread adoption in development pipelines. Goal. We empirically assess the energy efficiency of Python code generated by LLMs against human-written code and code developed by a Green software expert. Method. We test 363 solutions to 9 coding problems from the EvoEval benchmark using 6 widespread LLMs with 4 prompting techniques, and comparing them to human-developed solutions. Energy consumption is measured on three different hardware platforms: a server, a PC, and a Raspberry Pi for a total of ~881h (36.7 days). Results. Human solutions are 16% more energy-efficient on the server and 3% on the Raspberry Pi, while LLMs outperform human developers by 25% on the PC. Prompting does not consistently lead to energy savings, where the most energy-efficient prompts vary by hardware platform. The code developed by a Green software expert is consistently more energy-efficient by at least 17% to 30% against all LLMs on all hardware platforms. Conclusions. Even though LLMs exhibit relatively good code generation capabilities, no LLM-generated code was more energy-efficient than that of an experienced Green software developer, suggesting that as of today there is still a great need of human expertise for developing energy-efficient Python code.

SEMay 6, 2024
A Controlled Experiment on the Energy Efficiency of the Source Code Generated by Code Llama

Vlad-Andrei Cursaru, Laura Duits, Joel Milligan et al.

Context. Nowadays, 83% of software developers use Large Language Models (LLMs) to generate code. LLMs recently became essential to increase the productivity of software developers and decrease the time and cost of software development. Developers ranging from novices to experts use LLM tools not only to detect and patch bugs, but also to integrate generated code into their software. However, as of today there is no objective assessment of the energy efficiency of the source code generated by LLM tools. Released in August 2023, Code Llama is one of the most recent LLM tools. Goal. In this paper, we present an empirical study that assesses the energy efficiency of Code Llama with respect to human-written source code. Method. We design an experiment involving three human-written benchmarks implemented in C++, JavaScript, and Python. We ask Code Llama to generate the code of the benchmarks using different prompts and temperatures. Therefore, we execute both implementations and profile their energy efficiency. Results. Our study shows that the energy efficiency of code generated by Code Llama is heavily-dependent on the chosen programming language and the specific code problem at hand. Also, human implementations tend to be more energy efficient overall, with generated JavaScript code outperforming its human counterpart. Moreover, explicitly asking Code Llama to generate energy-efficient code results in an equal or worse energy efficiency, as well as using different temperatures seems not to affect the energy efficiency of generated code. Conclusions. According to our results, code generated using Code Llama does not guarantee energy efficiency, even when prompted to do so. Therefore, software developers should evaluate the energy efficiency of generated code before integrating it into the software system under development.

ROMar 25, 2021
Mining Energy-Related Practices in Robotics Software

Michel Albonico, Ivano Malavolta, Gustavo Pinto et al.

Robots are becoming more and more commonplace in many industry settings. This successful adoption can be partly attributed to (1) their increasingly affordable cost and (2) the possibility of developing intelligent, software-driven robots. Unfortunately, robotics software consumes significant amounts of energy. Moreover, robots are often battery-driven, meaning that even a small energy improvement can help reduce its energy footprint and increase its autonomy and user experience. In this paper, we study the Robot Operating System (ROS) ecosystem, the de-facto standard for developing and prototyping robotics software. We analyze 527 energy-related data points (including commits, pull-requests, and issues on ROS-related repositories, ROS-related questions on StackOverflow, ROS Discourse, ROS Answers, and the official ROS Wiki). Our results include a quantification of the interest of roboticists on software energy efficiency, 10 recurrent causes, and 14 solutions of energy-related issues, and their implied trade-offs with respect to other quality attributes. Those contributions support roboticists and researchers towards having energy-efficient software in future robotics projects.

SENov 12, 2020
A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits

Steffen Herbold, Alexander Trautsch, Benjamin Ledel et al.

Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.

SESep 26, 2018
Datasets of Android Applications: a Literature Review

Franz-Xaver Geiger, Ivano Malavolta

Mobile phones and tablets have become the most widely used computing devices, with a large predominance of the Android platform. As a natural evolution, the development of Android applications has surged and has become a major field of study, with research efforts ranging from energy efficiency, to code smells, performance, maintainability, security, etc. These kind of challenges ask for dedicated solutions, tools, and datasets. This survey identifies and reviews 31 existing datasets of Android applications and classifies each of them according to key features, such as the total number of apps it contains, whether the commit history of the apps is available, whether it focusses on the source code or on the executable binaries of the apps, the sources used for building the dataset, etc. This study can benefit both the experienced and the novice researcher interested on doing research on Android apps, which can use the results of our study as a map for identifying the most suitable datasets for their research objectives.

SENov 8, 2016
Protocol for a Systematic Mapping Study on Collaborative Model-Driven Software Engineering

Mirco Franzago, Davide Di Ruscio, Ivano Malavolta et al.

Nowadays, collaborative modeling performed by multiple stakeholders is gaining a growing interest in both academia and practice. However, it poses a set of research challenges, such as large and complex models management, support for multi-user modeling environments, and synchronization mechanisms like models migration and merging, conflicts management, models versioning and rollback support. A body of knowledge in the scientific literature about collaborative model-driven software engineering (MDSE) exists. Still, those studies are scattered across different independent research areas, such as software engineering, model-driven engineering languages and systems, model integrated computing, etc., and a study classifying and comparing the various approaches and methods for collaborative MDSE is still missing. Under this perspective, a systematic mapping study (SMS) can help researchers and practitioners in (i) having a complete, comprehensive and valid picture of the state of the art about collaborative MDSE, and (ii) identifying potential gaps in current research and future research directions.

SYMay 31, 2016
Cyber-Physical Systems Security: a Systematic Mapping Study

Yuriy Zacchia Lun, Alessandro D'Innocenzo, Ivano Malavolta et al.

Cyber-physical systems are integrations of computation, networking, and physical processes. Due to the tight cyber-physical coupling and to the potentially disrupting consequences of failures, security here is one of the primary concerns. Our systematic mapping study sheds some light on how security is actually addressed when dealing with cyber-physical systems. The provided systematic map of 118 selected studies is based on, for instance, application fields, various system components, related algorithms and models, attacks characteristics and defense strategies. It presents a powerful comparison framework for existing and future research on this hot topic, important for both industry and academia.

SEFeb 13, 2015
Stakeholders, Viewpoints and Languages of a Modelling Framework for the Design and Development of Data-Intensive Mobile Apps

Mirco Franzago, Ivano Malavolta, Henry Muccini

Today millions of mobile apps are downloaded and used all over the world. Guidelines and best practices on how to design and develop mobile apps are being periodically released, mainly by mobile platform vendors and researchers. They cover different concerns, and refer to different technical and non-technical stakeholders. Still, mobile applications are developed with ad-hoc development processes, and on-paper best practices. In this paper we discuss a multi-view modelling framework supporting the collaborative design and development of mobile apps. The proposed framework embraces the Model-Driven Engineering methodology. This paper provides an overall view of the modelling framework in terms of its main stakeholders, viewpoints, and modelling languages.