Joel Witzke

61.2DCMay 3

Learning Process Energy Profiles from Node-Level Power Data

Jonathan Bader, Julius Irion, Jannis Kappel et al.

The growing demand for data center capacity, driven by the growth of high-performance computing, cloud computing, and especially artificial intelligence, has led to a sharp increase in data center energy consumption. To improve energy efficiency, gaining process-level insights into energy consumption is essential. While node-level energy consumption data can be directly measured with hardware such as power meters, existing mechanisms for estimating per-process energy usage, such as Intel RAPL, are limited to specific hardware and provide only coarse-grained, domain-level measurements. Our proposed approach models per-process energy profiles by leveraging fine-grained process-level resource metrics collected via eBPF and perf, which are synchronized with node-level energy measurements obtained from an attached power distribution unit. By statistically learning the relationship between process-level resource usage and node-level energy consumption through a regression-based model, our approach enables more fine-grained per-process energy predictions.

15.2DCApr 20

Optimizing Memory Allocation in Distributed Clusters with Predictive Modeling

Jonathan Bader, Edgar Blumenthal, Marten Eckardt et al.

In modern distributed systems, efficient resource allocation is a vital aspect to maintain scalability, reduce operational costs, and ensure fast execution even across heterogeneous workloads. Predictive models for resource usage are essential tools for optimizing allocation and preventing system bottlenecks. Predictive memory allocation has asymmetric costs as a key challenge: underallocation causes failures while overallocation wastes memory. We propose a regression method based on a LightGBM and XGBoost ensemble trained to predict high conditional quantiles. To further account for the high cost of underallocations we add a multiplicative safety factor. With our method we are able to reduce the number of under-allocated jobs from 4.17% to 2.89% and average overallocation from 148% to 44.51% on a real-world dataset of build jobs provided by SAP. We further explore the pareto frontier between optimization for underallocation and for overallocation.

Joel Witzke

2 Papers