46.8PFMar 17Code
Leveraging LLMs for Structured Information Extraction and Analysis from Cloud Incident Reports (Work In Progress Paper)Xiaoyu Chu, Shashikant Ilager, Yizhen Zang et al.
Incident management is essential to maintain the reliability and availability of cloud computing services. Cloud vendors typically disclose incident reports to the public, summarizing the failures and recovery process to help minimize their impact. However, such reports are often lengthy and unstructured, making them difficult to understand, analyze, and use for long-term dependability improvements. The emergence of LLMs offers new opportunities to address this challenge, but how to achieve this is currently understudied. In this paper, we explore the use of cutting-edge LLMs to extract key information from unstructured cloud incident reports. First, we collect more than 3,000 incident reports from 3 leading cloud service providers (AWS, AZURE, and GCP), and manually annotate these collected samples. Then, we design and compare 6 prompt strategies to extract and classify different types of information. We consider 6~LLM models, including 3 lightweight and 3 state-of-the-art (SotA), and evaluate model accuracy, latency, and token cost across datasets, models, prompts, and extracted fields. Our study has uncovered the following key findings: (1) LLMs achieve high metadata extraction accuracy, $75\%\text{--}95\%$ depending on the dataset. (2) Few-shot prompting generally improves accuracy for meta-data fields except for classification, and has better (lower) latency due to shorter output-tokens but requires $1.5\text{--}2\times$ more input-tokens. (3) Lightweight models (e.g., Gemini~2.0, GPT~3.5) offer favorable trade-offs in accuracy, cost, and latency; SotA models yield higher accuracy at significantly greater cost and latency. Our study provides tools, methodologies, and insights for leveraging LLMs to accurately and efficiently extract incident-report information. The FAIR data and code are publicly available at https://github.com/atlarge-research/llm-cloud-incident-extraction.
5.6DCMar 31Code
M3SA: Exploring Datacenter Performance and Climate-Impact with Multi- and Meta-Model Simulation and AnalysisRadu Nicolae, Dante Niewenhuis, Sacheendra Talluri et al.
Datacenters are vital to our digital society, but consume a considerable fraction of global electricity and demand is projected to increase. To improve their sustainability and performance, we envision that simulators will become primary decision-making tools. However, and unlike other fields focusing on key societal infrastructure such as waterworks and mass transit, datacenter simulators do not yet combine multiple independent models into their operation and thus suffer from issues associated with singular models, such as specialization, and lack of adaptability to operational phenomena. To address this challenge, we propose M3SA, a datacenter simulation and analysis framework that uses discrete-event simulation to predict, for each model, the impact on climate and performance under various realistic datacenter conditions, and then combines these predictions. We design an architecture for simulating multiple concurrent models (Multi-Model), a technique for integrating the results of multiple models into a Meta-Model, and a procedure for quantifying Meta-Model accuracy. Through experiments with an M3SA prototype, we show that (i) M3SA can reproduce and enhance peer-reviewed experiments; (ii) M3SA can predict operational phenomena (e.g., failures) of datacenters, running fundamentally different workload traces; (iii) M3SA enables various types of what-if and how-to analysis, such as how to configure CO2-aware migration over yearly energy-production patterns. M3SA has been integrated into the open-source simulator OpenDC and is available at: https://github.com/atlarge-research/opendc-m3sa.
15.8DCMar 12Code
OpenDC-STEAM: Realistic Modeling and Systematic Exploration of Composable Techniques for Sustainable DatacentersDante Niewenhuis, Sacheendra Talluri, Alexandru Iosup et al.
The need to reduce datacenter carbon footprint is urgent. While many sustainability techniques have been proposed, they are often evaluated in isolation, using limited setups or analytical models that overlook real-world dynamics and interactions between methods. This makes it challenging for researchers and operators to understand the effectiveness and trade-offs of combining such techniques. We design OpenDC-STEAM, an open-source customizable datacenter simulator, to investigate the individual and combined impact of sustainability techniques on datacenter operational and embodied carbon emissions, and their trade-off with performance. Using STEAM, we systematically explore three representative techniques - horizontal scaling, leveraging batteries, and temporal shifting - with diverse representative workloads, datacenter configurations, and carbon-intensity traces. Our analysis highlights that datacenter dynamics can influence their effectiveness and that combining strategies can significantly lower emissions, but introduces complex cost-emissions-performance trade-offs that STEAM can help navigate. STEAM supports the integration of new models and techniques, making it a foundation framework for holistic, quantitative, and reproducible research in sustainable computing. Following open-science principles, STEAM is available as FOSS: https://github.com/atlarge-research/OpenDC-STEAM.
DCFeb 15, 2018
Massivizing Computer Systems: a Vision to Understand, Design, and Engineer Computer Ecosystems through and beyond Modern Distributed SystemsAlexandru Iosup, Alexandru Uta, Laurens Versluis et al.
Our society is digital: industry, science, governance, and individuals depend, often transparently, on the inter-operation of large numbers of distributed computer systems. Although the society takes them almost for granted, these computer ecosystems are not available for all, may not be affordable for long, and raise numerous other research challenges. Inspired by these challenges and by our experience with distributed computer systems, we envision Massivizing Computer Systems, a domain of computer science focusing on understanding, controlling, and evolving successfully such ecosystems. Beyond establishing and growing a body of knowledge about computer ecosystems and their constituent systems, the community in this domain should also aim to educate many about design and engineering for this domain, and all people about its principles. This is a call to the entire community: there is much to discover and achieve.