LGAug 30, 2022

Analysis of Distributed Deep Learning in the Cloud

arXiv:2208.14344v35 citationsh-index: 85
Originality Incremental advance
AI Analysis

This work addresses cost and performance optimization for users running distributed deep learning in cloud environments, though it is incremental as it extends prior profiling methods.

The paper tackles the problem of inefficiencies in distributed deep learning on public clouds by introducing a profiler that identifies communication stalls, revealing that expensive GPU instances may not be optimal and can suffer up to 90% overhead from intra-machine interconnects and 5x slowdowns in network-connected setups.

We aim to resolve this problem by introducing a comprehensive distributed deep learning (DDL) profiler, which can determine the various execution "stalls" that DDL suffers from while running on a public cloud. We have implemented the profiler by extending prior work to additionally estimate two types of communication stalls - interconnect and network stalls. We train popular DNN models using the profiler to characterize various AWS GPU instances and list their advantages and shortcomings for users to make an informed decision. We observe that the more expensive GPU instances may not be the most performant for all DNN models and AWS may sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads up to 90% of DNN training time and network-connected instances can suffer from up to 5x slowdown compared to training on a single instance. Further, we model the impact of DNN macroscopic features such as the number of layers and the number of gradients on communication stalls. Finally, we propose a measurement-based recommendation model for users to lower their public cloud monetary costs for DDL, given a time budget.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes