DC LGJan 24, 2018

On Scale-out Deep Learning Training for Cloud and HPC

Srinivas Sridharan, Karthikeyan Vaidyanathan, Dhiraj Kalamkar, Dipankar Das, Mikhail E. Smorkalov, Mikhail Shiryaev, Dheevatsa Mudigere, Naveen Mellempudi, Sasikanth Avancha, Bharat Kaul, Pradeep Dubey

arXiv:1801.08030v111.331 citations

Originality Synthesis-oriented

AI Analysis

It addresses the need for faster training of large deep neural networks, which is critical for researchers and practitioners in AI, though it appears incremental as it builds on existing distributed training methods.

The paper tackles the challenge of scaling synchronous Stochastic Gradient Descent for deep learning training across hundreds to thousands of nodes in cloud and HPC systems, presenting the Intel MLSL library with proof-points demonstrating efficient scaling.

The exponential growth in use of large deep neural networks has accelerated the need for training these deep neural networks in hours or even minutes. This can only be achieved through scalable and efficient distributed training, since a single node/card cannot satisfy the compute, memory, and I/O requirements of today's state-of-the-art deep neural networks. However, scaling synchronous Stochastic Gradient Descent (SGD) is still a challenging problem and requires continued research/development. This entails innovations spanning algorithms, frameworks, communication libraries, and system design. In this paper, we describe the philosophy, design, and implementation of Intel Machine Learning Scalability Library (MLSL) and present proof-points demonstrating scaling DL training on 100s to 1000s of nodes across Cloud and HPC systems.

View on arXiv PDF

Similar