DCNov 15, 2022
Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data AnalyticsDominik Scheinert, Soeren Becker, Jonathan Bader et al.
Choosing a good resource configuration for big data analytics applications can be challenging, especially in cloud environments. Automated approaches are desirable as poor decisions can reduce performance and raise costs. The majority of existing automated approaches either build performance models from previous workload executions or conduct iterative resource configuration profiling until a near-optimal solution has been found. In doing so, they only obtain an implicit understanding of the underlying infrastructure, which is difficult to transfer to alternative infrastructures and, thus, profiling and modeling insights are not sustained beyond very specific situations. We present Perona, a novel approach to robust infrastructure fingerprinting for usage in the context of big data analytics. Perona employs common sets and configurations of benchmarking tools for target resources, so that resulting benchmark metrics are directly comparable and ranking is enabled. Insignificant benchmark metrics are discarded by learning a low-dimensional representation of the input metric vector, and previous benchmark executions are taken into consideration for context-awareness as well, allowing to detect resource degradation. We evaluate our approach both on data gathered from our own experiments as well as within related works for resource configuration optimization, demonstrating that Perona captures the characteristics from benchmark runs in a compact manner and produces representations that can be used directly.
DCMay 3
Learning Process Energy Profiles from Node-Level Power DataJonathan Bader, Julius Irion, Jannis Kappel et al.
The growing demand for data center capacity, driven by the growth of high-performance computing, cloud computing, and especially artificial intelligence, has led to a sharp increase in data center energy consumption. To improve energy efficiency, gaining process-level insights into energy consumption is essential. While node-level energy consumption data can be directly measured with hardware such as power meters, existing mechanisms for estimating per-process energy usage, such as Intel RAPL, are limited to specific hardware and provide only coarse-grained, domain-level measurements. Our proposed approach models per-process energy profiles by leveraging fine-grained process-level resource metrics collected via eBPF and perf, which are synchronized with node-level energy measurements obtained from an attached power distribution unit. By statistically learning the relationship between process-level resource usage and node-level energy consumption through a regression-based model, our approach enables more fine-grained per-process energy predictions.
DCApr 20
Optimizing Memory Allocation in Distributed Clusters with Predictive ModelingJonathan Bader, Edgar Blumenthal, Marten Eckardt et al.
In modern distributed systems, efficient resource allocation is a vital aspect to maintain scalability, reduce operational costs, and ensure fast execution even across heterogeneous workloads. Predictive models for resource usage are essential tools for optimizing allocation and preventing system bottlenecks. Predictive memory allocation has asymmetric costs as a key challenge: underallocation causes failures while overallocation wastes memory. We propose a regression method based on a LightGBM and XGBoost ensemble trained to predict high conditional quantiles. To further account for the high cost of underallocations we add a multiplicative safety factor. With our method we are able to reduce the number of under-allocated jobs from 4.17% to 2.89% and average overallocation from 148% to 44.51% on a real-world dataset of build jobs provided by SAP. We further explore the pareto frontier between optimization for underallocation and for overallocation.
DCApr 14
Intelligent resource prediction for SAP HANA continuous integration build workloadsTorsten Mandel, Jonathan Bader, Hanyoung Yoo et al.
Large enterprises often operate extensive Continuous Integration (CI) pipelines on large, heterogeneous compute clusters, where conservative, statically defined resource requirements are used to ensure build reliability. This practice leads to substantial system memory over-allocation, reduced cluster utilization, and increased operational costs. In this paper, we motivate the need for intelligent resource prediction by analyzing over 300,000 historical build executions from a production CI environment with more than one thousand compute nodes. Our analysis shows that, on average, more than 60% of allocated system memory remains unused. We then compare multiple machine learning approaches for predicting build task memory usage, including classification-based methods and regression-based quantile prediction. Our final solution employs a LightGBM-XGBoost quantile regression ensemble optimized to minimize under-allocation while reducing over-provisioning. We integrate this solution into the production CI pipeline via a microservice-based orchestration layer, achieving average memory savings of approximately 36GB per build and reducing under-allocation rates to below 0.3% without negatively impacting build execution times.
DCNov 16, 2021
On the Potential of Execution Traces for Batch Processing Workload Optimization in Public CloudsDominik Scheinert, Alireza Alamgiralem, Jonathan Bader et al.
With the growing amount of data, data processing workloads and the management of their resource usage becomes increasingly important. Since managing a dedicated infrastructure is in many situations infeasible or uneconomical, users progressively execute their respective workloads in the cloud. As the configuration of workloads and resources is often challenging, various methods have been proposed that either quickly profile towards a good configuration or determine one based on data from previous runs. Still, performance data to train such methods is often lacking and must be costly collected. In this paper, we propose a collaborative approach for sharing anonymized workload execution traces among users, mining them for general patterns, and exploiting clusters of historical workloads for future optimizations. We evaluate our prototype implementation for mining workload execution graphs on a publicly available trace dataset and demonstrate the predictive value of workload clusters determined through traces only.