On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds
This work addresses the problem of resource configuration for cloud users, offering an incremental improvement over existing methods by leveraging shared data to reduce costly profiling.
The paper tackles the challenge of optimizing batch processing workloads in public clouds by proposing a collaborative approach for sharing anonymized execution traces to mine patterns and cluster workloads, demonstrating the predictive value of these clusters on a public dataset.
With the growing amount of data, data processing workloads and the management of their resource usage becomes increasingly important. Since managing a dedicated infrastructure is in many situations infeasible or uneconomical, users progressively execute their respective workloads in the cloud. As the configuration of workloads and resources is often challenging, various methods have been proposed that either quickly profile towards a good configuration or determine one based on data from previous runs. Still, performance data to train such methods is often lacking and must be costly collected. In this paper, we propose a collaborative approach for sharing anonymized workload execution traces among users, mining them for general patterns, and exploiting clusters of historical workloads for future optimizations. We evaluate our prototype implementation for mining workload execution graphs on a publicly available trace dataset and demonstrate the predictive value of workload clusters determined through traces only.