Philipp Thamm

70.0DCMay 29

Augur: Pre-Execution Energy Prediction for Workflow Tasks in Heterogeneous Clusters

Kathleen West, Vasilis Bountris, Philipp Thamm et al.

Scientific workflows are widely used to process large quantities of data, leading to significant energy consumption and carbon emissions. To reduce this environmental impact, energy and carbon-aware scheduling approaches could be employed. However, such methods require runtime and energy predictions, which are typically only available for workflows that have been executed previously. Meanwhile, scientists may execute new or modified workflows, use workflows with different input data, or run them on alternative infrastructure. To address this critical gap, we propose Augur, a novel method to predict the energy consumption of scientific workflow tasks prior to execution. By efficiently profiling both the available cluster infrastructure and the workflow at hand, Augur is capable of predicting the overall energy consumption of the workflow with a median prediction error of $16.3\pm15.3\%$ compared to Ichnos, an energy estimation method that uses fitted power models, and $18.2\pm14.7\%$ compared to Intel RAPL, as observed in our experimental evaluation on public and private cloud infrastructure. Relying on only minimal historical execution data, Augur outperforms two state-of-the-art methods in predicting both task runtime and total workflow energy, providing a robust foundation for energy-efficient and carbon-aware scientific data analysis.

30.2DCMay 21

Nf-PEAK: Process-Based Energy Attribution for Nextflow Workflows on Kubernetes Clusters

Philipp Thamm, Somayeh Mohammadi, Kathleen West et al.

Scientific workflows are pipelines of interdependent tasks. They are increasingly executed on shared Kubernetes clusters via workflow engines such as Nextflow. Their energy consumption matters for both cost and sustainability. It is necessary to examine and optimize workflow tasks individually, because they can be very heterogeneous. However, estimating task-level energy on clusters is difficult: Intel RAPL counters report only node-level energy, access to counters and host process information is typically restricted, and concurrent workloads introduce resource contention and measurement noise. We present Nf-PEAK, a containerized method to attribute CPU-package and DRAM energy to individual processes and Nextflow tasks. Nf-PEAK (i) identifies workflow pods, (ii) maps pods to host processes via cgroup metadata, (iii) samples RAPL and per-process performance counters, and (iv) applies a non-linear energy-credit model before aggregating results at task level. On a Kubernetes cluster, we evaluate three nf-core workflows under controlled co-located CPU load. Nf-PEAK reaches an average Mean Absolute Percentage Error of 6.6% in isolated runs and 10.9% when an unrelated workload saturates 8 of 32 hardware threads per node, and remains stable across 2, 3, 4, and 8 nodes. Compared to the state-of-the-art Kubernetes tool Kepler, Nf-PEAK yields lower error on average, particularly under co-located load.

Philipp Thamm

2 Papers