DCLGJan 9, 2025

Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters

arXiv:2501.05563v11 citationsh-index: 74INFOCOM
Originality Incremental advance
AI Analysis

This addresses the challenge of scheduling deep learning workloads in GPU clusters for improved resource utilization and training efficiency, representing an incremental improvement over existing scheduling methods.

The paper tackles the problem of efficient job scheduling for distributed deep learning training with mixed parallelisms in GPU clusters by proposing an adaptive shortest-remaining-processing-time-first (A-SRPT) algorithm, which minimizes inter-server communication overhead and achieves theoretically provable competitive scheduling efficiency.

The recent explosive growth of deep learning (DL) models has necessitated a compelling need for efficient job scheduling for distributed deep learning training with mixed parallelisms (DDLwMP) in GPU clusters. This paper proposes an adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm, a novel prediction-assisted online scheduling approach designed to mitigate the challenges associated with DL cluster scheduling. By modeling each job as a graph corresponding to heterogeneous Deep Neural Network (DNN) models and their associated distributed training configurations, A-SRPT strategically assigns jobs to the available GPUs, thereby minimizing inter-server communication overhead. Observing that most DDLwMP jobs recur, A-SRPT incorporates a random forest regression model to predict training iterations. Crucially, A-SRPT maps the complex scheduling problem into a single-machine instance, which is addressed optimally by a preemptive "shortest-remaining-processing-time-first" strategy. This optimized solution serves as a guide for actual job scheduling within the GPU clusters, leading to a theoretically provable competitive scheduling efficiency. We conduct extensive real-world testbed and simulation experiments to verify our proposed algorithms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes