DCLGPFAug 18, 2020

Benchmarking network fabrics for data distributed training of deep neural networks

arXiv:2008.08057v15 citations
AI Analysis

This work addresses the performance concerns for researchers and engineers using data parallel training in HPC environments, but it is incremental as it benchmarks existing technologies without introducing new methods.

The paper examined how different network fabrics and software primitives affect data distributed deep learning, finding that Ethernet-based networking in shared HPC systems does not significantly impact training times for common deep neural networks or traditional HPC applications.

Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for communicating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes