DCAIDec 27, 2021

Automatic Configuration for Optimal Communication Scheduling in DNN Training

arXiv:2112.13509v1
Originality Incremental advance
AI Analysis

This work addresses communication inefficiencies in distributed DNN training, offering a dynamic solution for improved training speed, though it is incremental as it builds on the existing ByteScheduler framework.

The paper tackles the problem of suboptimal communication scheduling in distributed DNN training due to static hyper-parameter configurations, presenting AutoByte, a real-time method that dynamically tunes parameters and achieves up to 33.2% higher performance than static configurations.

ByteScheduler partitions and rearranges tensor transmissions to improve the communication efficiency of distributed Deep Neural Network (DNN) training. The configuration of hyper-parameters (i.e., the partition size and the credit size) is critical to the effectiveness of partitioning and rearrangement. Currently, ByteScheduler adopts Bayesian Optimization (BO) to find the optimal configuration for the hyper-parameters beforehand. In practice, however, various runtime factors (e.g., worker node status and network conditions) change over time, making the statically-determined one-shot configuration result suboptimal for real-world DNN training. To address this problem, we present a real-time configuration method (called AutoByte) that automatically and timely searches the optimal hyper-parameters as the training systems dynamically change. AutoByte extends the ByteScheduler framework with a meta-network, which takes the system's runtime statistics as its input and outputs predictions for speedups under specific configurations. Evaluation results on various DNN models show that AutoByte can dynamically tune the hyper-parameters with low resource usage, and deliver up to 33.2\% higher performance than the best static configuration in ByteScheduler.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes