DCAIDec 23, 2024

Power- and Fragmentation-aware Online Scheduling for GPU Datacenters

arXiv:2412.17484v15 citationsh-index: 24CCGRID
Originality Incremental advance
AI Analysis

It addresses operational costs and energy demands for datacenter operators managing AI workloads, but is incremental as it builds on prior fragmentation methods.

This work tackles the online scheduling problem in GPU datacenters by proposing PWR, a novel policy that minimizes power consumption through power-efficient GPU and CPU selections, and when combined with the existing Fragmentation Gradient Descent (FGD) policy, achieves a balanced trade-off between reducing power usage and minimizing GPU fragmentation in simulated cluster experiments.

The rise of Artificial Intelligence and Large Language Models is driving increased GPU usage in data centers for complex training and inference tasks, impacting operational costs, energy demands, and the environmental footprint of large-scale computing infrastructures. This work addresses the online scheduling problem in GPU datacenters, which involves scheduling tasks without knowledge of their future arrivals. We focus on two objectives: minimizing GPU fragmentation and reducing power consumption. GPU fragmentation occurs when partial GPU allocations hinder the efficient use of remaining resources, especially as the datacenter nears full capacity. A recent scheduling policy, Fragmentation Gradient Descent (FGD), leverages a fragmentation metric to address this issue. Reducing power consumption is also crucial due to the significant power demands of GPUs. To this end, we propose PWR, a novel scheduling policy to minimize power usage by selecting power-efficient GPU and CPU combinations. This involves a simplified model for measuring power consumption integrated into a Kubernetes score plugin. Through an extensive experimental evaluation in a simulated cluster, we show how PWR, when combined with FGD, achieves a balanced trade-off between reducing power consumption and minimizing GPU fragmentation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes