LGDCNEFeb 14, 2018

Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks

arXiv:1802.04924v2130 citations
Originality Highly original
AI Analysis

This addresses inefficiencies in distributed deep learning training for researchers and practitioners, offering a novel optimization approach that is not incremental but provides specific gains.

The paper tackles the problem of suboptimal runtime performance in large-scale distributed training of convolutional neural networks by proposing layer-wise parallelism, which allows each layer to use an individual parallelization strategy, resulting in increased training throughput, reduced communication costs, and better scalability to multiple GPUs while maintaining accuracy.

The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks. Current approaches parallelize training onto multiple devices by applying a single parallelization strategy (e.g., data or model parallelism) to all layers in a network. Although easy to reason about, these approaches result in suboptimal runtime performance in large-scale distributed training, since different layers in a network may prefer different parallelization strategies. In this paper, we propose layer-wise parallelism that allows each layer in a network to use an individual parallelization strategy. We jointly optimize how each layer is parallelized by solving a graph search problem. Our evaluation shows that layer-wise parallelism outperforms state-of-the-art approaches by increasing training throughput, reducing communication costs, achieving better scalability to multiple GPUs, while maintaining original network accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes