Alexandru Iosup

DC
10papers
253citations
Novelty35%
AI Score56

10 Papers

53.1DCApr 13Code
OpenDT: Exploring Datacenter Performance and Sustainability with a Self-Calibrating Digital Twin

Radu Nicolae, Jules van der Toorn, Stavriana Kraniti et al.

Datacenters are the backbone of our digital society, but raise numerous operational challenges. We envision digital twins becoming primary instruments in datacenter operations, continuously and autonomously helping with major operational decisions and with adapting ICT infrastructure, live, with a human-in-the-loop. Although fields such as aviation and autonomous driving successfully employ digital twins, an open-source digital twin for datacenters has not been demonstrated to the community. Addressing this challenge, we design, implement, and experiment using OpenDT, an Open-source, Digital Twin for monitoring and operating datacenters through a continuous integration cycle that includes: (1) live and continuous telemetry data; (2) discrete-event simulation using live telemetry from the physical ICT, with self-calibration; and (3) SLO-aware and human-approved feedback to physical ICT. Through trace-driven experiments with a prototype mainly covering stages 1 and 2 of the cycle, we show that (i) OpenDT can be used to reproduce peer-reviewed experiments and extend the analysis with performance and energy-efficiency results; (ii) OpenDT's online re-calibration can increase digital-twinning accuracy, quantified to a MAPE of 4.39% vs. 7.86% in peer-reviewed work. OpenDT adheres to FAIR/FOSS principles and is available at: https://github.com/atlarge-research/opendt/tree/hcp.

46.8PFMar 17Code
Leveraging LLMs for Structured Information Extraction and Analysis from Cloud Incident Reports (Work In Progress Paper)

Xiaoyu Chu, Shashikant Ilager, Yizhen Zang et al.

Incident management is essential to maintain the reliability and availability of cloud computing services. Cloud vendors typically disclose incident reports to the public, summarizing the failures and recovery process to help minimize their impact. However, such reports are often lengthy and unstructured, making them difficult to understand, analyze, and use for long-term dependability improvements. The emergence of LLMs offers new opportunities to address this challenge, but how to achieve this is currently understudied. In this paper, we explore the use of cutting-edge LLMs to extract key information from unstructured cloud incident reports. First, we collect more than 3,000 incident reports from 3 leading cloud service providers (AWS, AZURE, and GCP), and manually annotate these collected samples. Then, we design and compare 6 prompt strategies to extract and classify different types of information. We consider 6~LLM models, including 3 lightweight and 3 state-of-the-art (SotA), and evaluate model accuracy, latency, and token cost across datasets, models, prompts, and extracted fields. Our study has uncovered the following key findings: (1) LLMs achieve high metadata extraction accuracy, $75\%\text{--}95\%$ depending on the dataset. (2) Few-shot prompting generally improves accuracy for meta-data fields except for classification, and has better (lower) latency due to shorter output-tokens but requires $1.5\text{--}2\times$ more input-tokens. (3) Lightweight models (e.g., Gemini~2.0, GPT~3.5) offer favorable trade-offs in accuracy, cost, and latency; SotA models yield higher accuracy at significantly greater cost and latency. Our study provides tools, methodologies, and insights for leveraging LLMs to accurately and efficiently extract incident-report information. The FAIR data and code are publicly available at https://github.com/atlarge-research/llm-cloud-incident-extraction.

6.2DCMar 19
Literature Study on Operational Data Analytics Frameworks in Large-scale Computing Infrastructures

Shekhar Suman, Xiaoyu Chu, Alexandru Iosup

By 2025, there are zettabytes of data generated every year. The size and complexity of modern large-scale computing infrastructures like High-Performance Computing (HPC) systems continue to evolve and become complex, leaving us wondering about their manageability and sustainability concerns. Because of this reason, those complex systems are provided with fine-grained monitoring and Operational Data Analytics (ODA) capabilities to optimise their efficiency. In this literature study, we list the fundamental pillars of the large-scale computing infrastructures which enable its ODA capabilities, and conduct a study of the popular ODA frameworks operating in various such environments (predominantly HPC). Based on that, we propose a more holistic ODA framework matching the various layers of a large-scale graph-processing distributed ecosystem proposed by Sherif Sak et al, that extends the ODA functionalities presented in an existing novel ODA framework proposed by Netti et al. We compare the holistic ODA framework proposed by us to some of the state-of-the-art frameworks that we study as part of this literature to highlight the novelty, which would hopefully draw more attention to perform extensive research in this field. As part of creating awareness, we highlight the significant operational efficiencies observed as a result of the implementation of the state-of-the-art ODA frameworks to make the study appear beneficial for the readers, and lastly, discuss the trending research work ongoing in this field.

46.5DCMay 24
Kavier: Exploring Performance, Sustainability, and Efficiency of LLM Ecosystems under Inference through Cache-Aware Discrete-Event Simulation

Radu Nicolae, Alexandru Iosup, Animesh Trivedi et al.

Large Language Models (LLMs) are widely used by our increasingly digitalized society, but raise sustainability, performance, and financial concerns, especially as inference workloads grow. To improve the design and operation of LLM ecosystems, we envision simulators and simulation-based digital twins becoming primary decision-making tools. LLM ecosystems leverage many heterogeneous components, making simulation a non-trivial, yet critical operation. The simulation challenge is exacerbated by the absence of a comprehensive reference architecture of LLM ecosystems; the lack of such a conceptual model can be costly and could misguide the designers and engineers. Without a reference architecture, even the most experienced stakeholders could tinker in researching, engineering, or maintaining LLM ecosystems. In this work, we bring a three-fold contribution to the scientific community. Firstly, we synthesize, propose, and validate a reference architecture (RA) of LLM ecosystems under inference. Then, adhering to the reference architecture, we design Kavier, the first simulation instrument able to predict the performance, sustainability, and efficiency of LLM ecosystems under inference, through discrete-event and cache-aware simulation, focusing on Key-Value-(KV-)Caching and prompt prefix caching policies. Through experiments with a Kavier prototype and real-world traces, (i) we measure the accuracy of Kavier and its performance in massive-scale simulations, (ii) we compare the performance of different KV-Caching policies, and (iii) we analyze the performance, sustainability, and efficiency of LLM ecosystems under various prefix caching policies. Overall, we show that Kavier enables operators, researchers, and engineers to predict LLM ecosystems in a time, performance, and cost-efficient way.

5.5DCMar 31Code
M3SA: Exploring Datacenter Performance and Climate-Impact with Multi- and Meta-Model Simulation and Analysis

Radu Nicolae, Dante Niewenhuis, Sacheendra Talluri et al.

Datacenters are vital to our digital society, but consume a considerable fraction of global electricity and demand is projected to increase. To improve their sustainability and performance, we envision that simulators will become primary decision-making tools. However, and unlike other fields focusing on key societal infrastructure such as waterworks and mass transit, datacenter simulators do not yet combine multiple independent models into their operation and thus suffer from issues associated with singular models, such as specialization, and lack of adaptability to operational phenomena. To address this challenge, we propose M3SA, a datacenter simulation and analysis framework that uses discrete-event simulation to predict, for each model, the impact on climate and performance under various realistic datacenter conditions, and then combines these predictions. We design an architecture for simulating multiple concurrent models (Multi-Model), a technique for integrating the results of multiple models into a Meta-Model, and a procedure for quantifying Meta-Model accuracy. Through experiments with an M3SA prototype, we show that (i) M3SA can reproduce and enhance peer-reviewed experiments; (ii) M3SA can predict operational phenomena (e.g., failures) of datacenters, running fundamentally different workload traces; (iii) M3SA enables various types of what-if and how-to analysis, such as how to configure CO2-aware migration over yearly energy-production patterns. M3SA has been integrated into the open-source simulator OpenDC and is available at: https://github.com/atlarge-research/opendc-m3sa.

15.7DCMar 12Code
OpenDC-STEAM: Realistic Modeling and Systematic Exploration of Composable Techniques for Sustainable Datacenters

Dante Niewenhuis, Sacheendra Talluri, Alexandru Iosup et al.

The need to reduce datacenter carbon footprint is urgent. While many sustainability techniques have been proposed, they are often evaluated in isolation, using limited setups or analytical models that overlook real-world dynamics and interactions between methods. This makes it challenging for researchers and operators to understand the effectiveness and trade-offs of combining such techniques. We design OpenDC-STEAM, an open-source customizable datacenter simulator, to investigate the individual and combined impact of sustainability techniques on datacenter operational and embodied carbon emissions, and their trade-off with performance. Using STEAM, we systematically explore three representative techniques - horizontal scaling, leveraging batteries, and temporal shifting - with diverse representative workloads, datacenter configurations, and carbon-intensity traces. Our analysis highlights that datacenter dynamics can influence their effectiveness and that combining strategies can significantly lower emissions, but introduces complex cost-emissions-performance trade-offs that STEAM can help navigate. STEAM supports the integration of new models and techniques, making it a foundation framework for holistic, quantitative, and reproducible research in sustainable computing. Following open-science principles, STEAM is available as FOSS: https://github.com/atlarge-research/OpenDC-STEAM.

SESep 17, 2020Code
Serverless Applications: Why, When, and How?

Simon Eismann, Joel Scheuner, Erwin van Eyk et al.

Serverless computing shows good promise for efficiency and ease-of-use. Yet, there are only a few, scattered and sometimes conflicting reports on questions such as 'Why do so many companies adopt serverless?', 'When are serverless applications well suited?', and 'How are serverless applications currently implemented?' To address these questions, we analyze 89 serverless applications from open-source projects, industrial sources, academic literature, and scientific computing - the most extensive study to date.

SEAug 25, 2020
A Review of Serverless Use Cases and their Characteristics

Simon Eismann, Joel Scheuner, Erwin van Eyk et al.

The serverless computing paradigm promises many desirable properties for cloud applications - low-cost, fine-grained deployment, and management-free operation. Consequently, the paradigm has underwent rapid growth: there currently exist tens of serverless platforms and all global cloud providers host serverless operations. To help tune existing platforms, guide the design of new serverless approaches, and overall contribute to understanding this paradigm, in this work we present a long-term, comprehensive effort to identify, collect, and characterize 89 serverless use cases. We survey use cases, sourced from white and grey literature, and from consultations with experts in areas such as scientific computing. We study each use case using 24 characteristics, including general aspects, but also workload, application, and requirements. When the use cases employ workflows, we further analyze their characteristics. Overall, we hope our study will be useful for both academia and industry, and encourage the community to further share and communicate their use cases. This article appears also as a SPEC Technical Report: https://research.spec.org/fileadmin/user_upload/documents/rg_cloud/endorsed_publications/SPEC_RG_2020_Serverless_Usecases.pdf The article may be submitted for peer-reviewed publication.

DCFeb 15, 2018
Massivizing Computer Systems: a Vision to Understand, Design, and Engineer Computer Ecosystems through and beyond Modern Distributed Systems

Alexandru Iosup, Alexandru Uta, Laurens Versluis et al.

Our society is digital: industry, science, governance, and individuals depend, often transparently, on the inter-operation of large numbers of distributed computer systems. Although the society takes them almost for granted, these computer ecosystems are not available for all, may not be affordable for long, and raise numerous other research challenges. Inspired by these challenges and by our experience with distributed computer systems, we envision Massivizing Computer Systems, a domain of computer science focusing on understanding, controlling, and evolving successfully such ecosystems. Beyond establishing and growing a body of knowledge about computer ecosystems and their constituent systems, the community in this domain should also aim to educate many about design and engineering for this domain, and all people about its principles. This is a call to the entire community: there is much to discover and achieve.

SENov 1, 2016
Self-Awareness of Cloud Applications

Alexandru Iosup, Xiaoyun Zhu, Arif Merchant et al.

Cloud applications today deliver an increasingly larger portion of the Information and Communication Technology (ICT) services. To address the scale, growth, and reliability of cloud applications, self-aware management and scheduling are becoming commonplace. How are they used in practice? In this chapter, we propose a conceptual framework for analyzing state-of-the-art self-awareness approaches used in the context of cloud applications. We map important applications corresponding to popular and emerging application domains to this conceptual framework, and compare the practical characteristics, benefits, and drawbacks of self-awareness approaches. Last, we propose a roadmap for addressing open challenges in self-aware cloud and datacenter applications.