ML DC LGOct 11, 2019

Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning

Igor Colin, Ludovic Dos Santos, Kevin Scaman

arXiv:1910.05104v15.75 citations

Originality Incremental advance

AI Analysis

This work addresses optimization bottlenecks in distributed deep learning, offering theoretical insights and practical algorithms for accelerating training in pipeline-parallel setups, though it is incremental in extending known methods to this specific context.

The paper investigates the theoretical limits of pipeline parallel optimization for deep learning, providing matching lower and upper complexity bounds for smooth convex and non-convex functions, and introduces a novel algorithm (PPRS) for non-smooth convex functions that achieves near-linear speed-up with a depth-dependent acceleration, empirically showing it outperforms traditional methods in limited-sample scenarios like few-shot learning.

We investigate the theoretical limits of pipeline parallel learning of deep learning architectures, a distributed setup in which the computation is distributed per layer instead of per example. For smooth convex and non-convex objective functions, we provide matching lower and upper complexity bounds and show that a naive pipeline parallelization of Nesterov's accelerated gradient descent is optimal. For non-smooth convex functions, we provide a novel algorithm coined Pipeline Parallel Random Smoothing (PPRS) that is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension. While the convergence rate still obeys a slow $\varepsilon^{-2}$ convergence rate, the depth-dependent part is accelerated, resulting in a near-linear speed-up and convergence time that only slightly depends on the depth of the deep learning architecture. Finally, we perform an empirical analysis of the non-smooth non-convex case and show that, for difficult and highly non-smooth problems, PPRS outperforms more traditional optimization algorithms such as gradient descent and Nesterov's accelerated gradient descent for problems where the sample size is limited, such as few-shot or adversarial learning.

View on arXiv PDF

Similar