PFDCLGJan 29, 2024

GPU Cluster Scheduling for Network-Sensitive Deep Learning

arXiv:2401.16492v27 citationsh-index: 41
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficient resource allocation in distributed deep learning clusters, particularly under network congestion, offering significant performance gains for data centers and cloud providers.

The paper tackles the problem of scheduling GPU clusters for distributed deep learning by proposing a scheduler that consolidates GPU resources based on job sensitivity to network delays, resulting in up to 69% improvement in end-to-end makespan, 83% reduction in average job completion time, and 98% minimization of communication overheads under congested conditions.

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all jobs compared to the prevailing consolidation-based scheduling methods, while reducing the average job completion time by up to 83% and minimizing the communication overheads by up to 98% under congested networking conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes